Goto Session:

1.1 Opening Session: Plenary, Awards Ceremony & Keynote Addresses
1.1.1 UB01 Session 1
2.1 Executive Panel: The Electronics Innovation Landscape: Opportunities, Challenges and Strategies
2.2 Stochastic, Approximate and Neural Computing
2.3 Cache memory management for performance and reliability
2.4 Performance and Power Analysis
2.5 Reliability and Energy-Efficiency: Two Pillars of NoC Design
2.6 Advancing Test for Mixed-Signal and Microfluidic Circuits and Systems
2.7 EU Project Special Session: from Secure Clouds to reliable and variable HPC
2.8a Smart Medical Devices
2.8b Smart Medical Devices, Part 2
UB02 Session 2
3.0 LUNCH TIME KEYNOTE SESSION: Precision Medicine: Where Engineering and Life Science meet
3.1 IT&A Session: Parallel Ultra-Low-Power Computing for the IoT: Applications, Platforms, Circuits
3.2 Hot Topic Session: New Benchmarking Vectors for Emerging Devices, Circuits, and Architectures: Energy, Delay, and ... Accuracy
3.3 Hardware Trojans and Fault Attacks
3.4 Guardbanding and Approximation
3.5 Low-power brain inspired computing for embedded systems
3.6 Mechanisms for hardware fault testing, recovery and metastability management
3.7 Scheduling and Optimization
3.8 Addressing Challenges in Today's Datacenter Systems' Design
3.9 A tribute to Ralph Otten
UB03 Session 3
4.1 IT&A Session: The Emergence of Silicon Photonics: From High Performance Computing to Data Centers and Quantum Computing
4.2 Logic, Interconnects, Neurons: New Realizations
4.3 Efficient memory design
4.4 From functional validation to functional qualification
4.5 Hot Topic Session: On How to Design and Manage Exascale Computing System Technologies
4.6 Fault modeling, test generation and diagnosis
4.7 Process variation management for today's and tomorrow's computing
4.8 CV Fair DATE 2017
UB04 Session 4
5.1 IoT Day: IoT Perspectives
5.2 Emerging Computer Paradigms
5.3 Hot Topic Session: I'm Gonna Make an Approximation IoT Can't Refuse - Approximate Computing for Improving Power Efficiency of IoT and HPC
5.4 Solutions for efficient simulation and validation
5.5 Hot Topic Session: Spintronics-based Computing
5.6 Reuse and Integration of Test, Debug, and Reliability Infrastructure
5.7 Schedulability Analysis
IP2 Interactive Presentations
5.8 HiPEAC: European Network on High Performance and Embedded Architecture and Compilation
5.9 HiPEAC: European Network on High Performance and Embedded Architecture and Compilation
UB05 Session 5
6.1 IoT Day Hot Topic Session: IoT Enabling Technologies
6.2 IT&A Session: Panel: Ultra-Low-Power (ULF) Autonomously Powered Systems
6.3 Security Primitives
6.4 High-performance Reconfigurable Computing
6.5 Hot Topic Session: Memristor for Computing: Myth or Reality?
6.6 Industrial Experiences & EU Projects
6.7 Model-Based Design and Verification of Real-Time Systems
6.8 HiPEAC: European Network on High Performance and Embedded Architecture and Compilation
UB06 Session 6
7.0 LUNCH TIME KEYNOTE SESSION
UB07 Session 7
7.1 IoT Day Hot Topic Session: IoT Deployment
7.2 In-memory Computing and Security for Non-volatile Memory Technologies
7.3 Optimizing performance, energy and predictability via hardware/software codesign
7.4 Advances in Logic Synthesis
7.5 Hot Topic Session: The Engineering Challenges for Quantum Computing
7.6 Memory Reliability: Modeling and Mitigation
7.7 Resource management and analysis for embedded architectures
7.8 Smart Energy and Self-Powered Devices
IP3 Interactive Presentations
### 1.1 Opening Session: Plenary, Awards Ceremony & Keynote Addresses

**Date:** Tuesday 28 March 2017  
**Time:** 08:30 - 10:30  
**Location / Room:** Auditorium A  
**Chair:** David Atienza, EPFL, CH  
**Co-Chair:** Giorgio Di Natale, LIRMM, FR

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
</table>
| 08:30   | 1.1.1 | **WELCOME ADDRESSES**                                                             | **Speakers:** David Atienza 1 and Giorgio Di Natale 2  
1DATE 2017 General Chair, EPFL, CH; 2DATE 2017 Programme Chair, LIRMM, FR |
| 08:45   | 1.1.2 | **PRESENTATION OF DISTINGUISHED AWARDS**                                          |                                                                        |
KEYNOTE: DESIGN AUTOMATION IN THE ERA OF AI AND IOT: CHALLENGES AND PITFALLS

Speaker: Arvind Krishna, IBM Research, US

Abstract

The AI and IoT revolutions are twin phenomena that are reshaping business models, industries, and society. If we are to maximize their potential, we must overcome significant technical challenges with the help of the Design Automation and Test Community.

First, new computer architectures are required to accelerate solutions driven by cognitive computing, the term given to a comprehensive set of AI capabilities that includes not just machine learning but also data ingestion, data privacy, learning, reasoning, natural language, and conversation. These architectures must support each of these new technologies and manage extreme, cognitive workloads marked by unprecedented volumes of structured and unstructured data. This challenge poses important questions for the Design Automation and Test community about what new approaches can be taken.

A similar challenge is inherent in the rapid development of IoT, where the span of computing architecture varies from extremely low power constraints, limited bandwidth, and sporadic access at the “edge” of the network to the nearly infinite power and compute of data centers. This raises the question of how to maximize the design and placement of IoT systems, which will have to function for extended periods of time (up to ten years or more, like a pacemaker).

Unlike smartphones, these systems can’t simply be disposed of, which raises significant security concerns.

In his talk exploring these challenges, Dr. Krishna will emphasize that solutions can only come from an integrated hardware-software co-design approach. He will also highlight some of the leading-edge technologies IBM Research is developing to drive further innovation in the computing stack as the era governed by Moore's law comes to a close.

KEYNOTE: A NEW ERA OF HARDWARE MICROSERVICES IN THE CLOUD

Speaker: Doug Burger, Microsoft Research, US

Abstract

The Cloud is causing a major shift in both the business ecosystem and system infrastructures. The major hyperscale providers are building out highly-interconnected, worldwide computers at a scale that allows them to make significant first-party investments. This verticalization allows them to make cross-layer architectural changes more rapidly than would the old horizontal model. A second trend is the emergence of ultra-low latency requirements in the Cloud, moving storage, networking, and services from the millisecond to the microsecond regime. In this talk, I will describe how these architectural shifts are enabling the emergence of specialized hardware in datacenters, that enable services to be operated in the microsecond regime. On FPGAs, GPUs, and ASICs, these services can run with no CPU intervention, allowing much lower latencies and better cost structures than previously possible for key services such as deep learning. Over time this transition will enable a much broader collection of hardware IP to run at scale in the Cloud.

UB01 Session 1

Date: Tuesday 28 March 2017
Time: 10:30 - 12:30
Location / Room: Booth 1, Exhibition Area

Label | Presentation Title | Authors
--- | --- | ---
UB01.1 | NOXIM-XT: A BIT-ACCURATE POWER ESTIMATION SIMULATOR FOR NOCS | Pierre Bomel, Université de Bretagne Sud, FR
Authors: André Ross1, Johann Laurent2 and Erwan Moreac2
1LERIA, Université d’Angers, Angers, France, FR; 2Lab-STICC, Université de Bretagne Sud, Lorient, FR

Abstract

We have developed an enhanced version of Noxim (Noxim-XT) to estimate the energy consumption of a NoC in a SOC. Noxim-XT is used in a two-step methodology. First, applications are mapped on a SoC and their traffic is extracted by simulation with MPSoCbench. Second, Noxim-XT tests various hardware configurations of the NoC, and for each configuration, the application’s traffic is re-injected and replayed, an accurate performance and power breakdown is provided, and the user can choose different data coding strategies. With the help of Noxim XT, each configuration is bit-accurately estimated in terms of energy consumption. After simulation, a spatial mapping of the energy consumption is provided and highlights the hot-spots. Moreover, the new coding strategies allow significant energy saving. Noxim XT simulations and a FPGA-based prototype of a new coding strategy will be demonstrated at the U-booth to illustrate these works.

More information ...
TFA: TRANSPARENT CODE OFFLOADING ON FPGA

Presenter: Roberto Rigamonti, HEIG-VD/HES-SO, CH
Authors: Anthony Convers, Baptiste Delporte, Xavier Ruppen and Alberto Dassatti, HEIG-VD/HES-SO, CH

Abstract: Genomics, molecular dynamics, and machine learning are just the most recent examples of fields where FPGAs could provide the means to achieve interesting breakthroughs. However, HDL programming requires considerable multi-disciplinary skills, experience, large budgets, time, and a bit of wizardry. Given that most implementations are short-lived, the investment simply does not pay off. In this demo we propose a multi-vendor LLVM-based automated framework that can transparently - without the user or developer being aware of it - offload computing-intensive code fragments to FPGAs. The system relies on a performance monitor to detect computing-intensive code sections and, if they are suitable for offloading, extracts the Data Flow Graph and uses it to program an overlay pre-programmed on the FPGA, which then interacts with the Just-In-Time compiler executing the program. The overall process requires hundreds of microseconds, and can be easily reverted should the outcome be unsatisfactory.

More information ...

DEMONSTRATION OF HW/SW CO-PROCESSING WITH FPGA FOR FAST VISUAL NAVIGATION OF ROVERS

Presenter: Konstantinos Maragos, National Technical University of Athens, GR
Authors: George Lentaris and Dimitrios Soudris, National Technical University of Athens, GR

Abstract: Autonomy, speed and accuracy constitute vital factors for the successful rover-exploration missions. However, the extremely low performance of the on-board space-grade CPUs in conjunction with the increased complexity of the sophisticated computer vision algorithms become a serious bottleneck for fast rover navigation. In this work, we present a HW/SW co-design solution based on FPGA to accelerate visual odometry algorithms tailored to the needs of future Mars exploration missions being scheduled by European Space Agency. For demonstration purposes, we use a Xilinx Kintex-7 FPGA to process images and perform feature detection, description, and matching. The FPGA communicates via ethernet port with the host CPU, which performs filtering and egomotion estimation with absolute orientation. We present the navigation path of a hypothetical moving rover which processes successively stores images acquired by a hypothetical Martian surface while live-recording the CPU-FPGA co-processing.

More information ...

MULTI-CORE VERIFICATION: COMBINING MICROTESK AND SPIN FOR VERIFICATION OF MULTI-CORE MICROPROCESSORS

Presenter: Mikhail Chupilko, ISPRAS, RU
Authors: Alexander Kamkin, Mikhail Lebedev and Andrei Tatarnikov, ISPRAS, RU

Abstract: The complexity of modern cache coherence protocols (CCP) in multi-core microprocessors prevents from complete verification of shared memory subsystems by means of random test-program generators (TPG). The following steps are suggested to target the problem. The first step is to separately specify CCP features and generate CCP-specific events to be used in TPG when generating a test program (TP). The protocol is specified in Promela, with Spin making a test template (TT). Spin also produces UVM (or C+TESK) testbench to make the execution of the resulting TP to be controllable and deterministic. The second step is to let TPG produce the memory access instructions causing desired CCP-specific behavior. As a TPG we use MicroTESK. Its Ruby-based TTs abstractly describe future TPs. MicroTESK processes that TPs making TP with CCP-specific events. The resulting TP is executed together with the testbench to exactly reproduce the situation Spin had found to be important for such a protocol.

More information ...

A VOLTAGE-SCALABLE FULLY DIGITAL ON-CHIP MEMORY FOR ULTRA-LOW-POWER IOT PROCESSORS

Presenter: Jun Shiomi, Kyoto University, JP
Authors: Tohru Ishihara and Hidetoshi Onodera, Kyoto University, JP

Abstract: A voltage-scalable RISC processor integrating standard-cell based memory (SCM) is demonstrated. Unlike conventional processors, the processor has Standard-Cell based Memories (SCMs) as an alternative to conventional SRAM macros, enabling it to operate at a 0.4 V single-supply voltage. The processor is implemented with the fully automated cell-based design, which leads to low design costs. By scaling the supply voltage and applying the back-gate biasing techniques, the power dissipation of the SCMs is less than 20 μW, enabling the SCMs to operate with ambient energy source only. In this demonstration, the SCMs of the processor operates with a lemon battery as the ambient energy source.

More information ...

MARGOT: APPLICATION ADAPTATION THROUGH RUNTIME AUTOTUNING

Presenter: Gianluca Palermo, Politecnico di Milano, IT
Authors: Davide Gadioli, Emanuele Vitali and Cristina Silvano, Politecnico di Milano, IT

Abstract: Several classes of applications expose parameters that influence their extra-functional properties, such as the quality of the result or the performance. This leads the application designer to tune these parameters to find the configuration that produces the desired outcome. Given that the application requirements and the resources assigned to each application might vary at runtime, finding a one-fit-all configuration is not a trivial task. For this reason, we implemented the mARGOt framework that enhances an application with an adaptation layer in order to continuously tune the parameters according to the evolving situation. More in detail, mARGOt is composed of a monitoring infrastructure, an application-level adaptation engine and an extra-functional configuration framework based on the separation of concerns paradigm between functional and extra-functional aspects. At the booth, we plan to demonstrate the effectiveness of the proposed infrastructure on three real-life applications.

More information ...

XBARGEN: A TOOL FOR DESIGN SPACE EXPLORATION OF MEMRISTOR BASED CROSSBAR ARCHITECTURES

Presenter: Marcello Traiola, LIRMM, FR
Authors: Mario Barbaresci1 and Alberto Bosio2
1University of Naples Federico II, IT; 2University of Montpellier - LIRMM laboratories, FR

Abstract: The unceasing shrinking process of CMOS technology is leading to its physical limits, impacting several aspects, such as performances, power consumption and many others.Alternative solutions are under investigation in order to overcome CMOS limitations. Among them, the memristor is one of promising technologies. Several works have been proposed so far, describing how to synthesize boolean logic functions on memristors-based crossbar architecture. However, depending on the synthesis parameters, different architectures can be obtained. In this demo, we show a Design Space Exploration (DSE) that we use to select the best crossbar configuration on the basis of workload dependent and independent parameters, such as area, time and power consumption. The main advantage is that it does not require any simulation and thus it avoid any runtime overheads. The demo aims to show the tool prototype on a selected set of benchmarks which will be synthesized on a memristor-based crossbar circuit.

More information ...
2.1 Executive Panel: The Electronics Innovation Landscape: Opportunities, Challenges and Strategies

Date: Tuesday 28 March 2017  
Time: 11:30 - 13:00  
Location / Room: Auditorium A

Chair: Alberto Sangiovanni-Vincentelli, UCB, US

From autonomous driving to big data, from machine learning to cyber-physical systems, from robotics to the internet of everything, from brain-machine interfaces to the human intranet, innovation is moving at a pace that has never been seen before. To face the large investments and increasing global competition, mergers and acquisitions have sped up in all areas including the semiconductor industry that has been possibly the most defining enabling factor of these disruptive technologies. The panel will address what are the structural factors to sustain innovations and what are the strategies that some of important actors in the industrial and research sector are embracing. The panel will also address the opportunities and difficulties of the different regions of the world in the changing social and economic landscape. The panel will begin with an introductory presentation about the state of technology and innovations in the areas outlined above. Then executives from IBM, ST Microelectronics and Leti will address the problems to face and the strategies to embrace in a challenging competitive landscape.

Panelists:
- Arvind Krishna, Sr. VP, Head of Research, IBM, US
- Marie-Noëlle Semeria, CEO, CEA/Leti, FR
- Benedetto Vigna, EVP & GM, Analog & MEMS Group, STMicroelectronics, IT

13:00 End of session

Lunch Break in Garden Foyer  

Keynote Lecture session 3.0 in "Garden Foyer" 1350 - 1420

Lunch Break in the Garden Foyer  

On all conference days (Tuesday to Thursday), a buffet lunch will be offered in the Garden Foyer, in front of the session rooms. Kindly note that this is restricted to conference delegates possessing a lunch voucher only. When entering the lunch break area, delegates will be asked to present the corresponding lunch voucher of the day. Once the lunch area is being left, re-entrance is not allowed for the respective lunch.

2.2 Stochastic, Approximate and Neural Computing

Date: Tuesday 28 March 2017  
Time: 11:30 - 13:00  
Location / Room: 4BC
Stochastic and approximate computing is an approach developed to improve energy efficiency of computer hardware. First paper presents a framework for quantifying and managing accuracy in stochastic circuits design. Second paper deals with a new approximate multiplier design. Energy efficient hybrid stochastic-binary neural-networks are proposed in the third paper. The last paper addresses a new retraining method improving fault tolerance in RRAM crossbars.

### Abstract

Stochastic circuits (SCs) offer tremendous area and power-consumption benefits at the expense of computational inaccuracies. Managing accuracy is a central problem in SC design and has no counterpart in conventional circuit synthesis. It raises a basic question: how to build a systematic design flow for stochastic circuits? We present, for the first time, a systematic design approach to control the accuracy of SCs and balance it against other design parameters. We express the (in)accuracy of a circuit processing n-bit stochastic numbers by the numerical deviation of the computed value from the expected result, in conjunction with a confidence level. Using the theory of Monte Carlo simulation, we derive expressions for the stochastic number length required for a desired level of accuracy, or vice versa. We discuss the integration of the theory into a design framework that is applicable to both combinational and sequential SCs. We show that, for combinational SCs, accuracy is independent of the circuit's size or complexity, a surprising result. We also show how the analysis can identify subtle errors in both combinational and sequential designs.

### Energy-Efficient Approximate Multiplier Design Using Bit Significance-Driven Logic Compression

**Presentation Title**: ENERGY-EFFICIENT APPROXIMATE MULTIPLIER DESIGN USING BIT SIGNIFICANCE-DRIVEN LOGIC COMPRESSION

**Speaker**: Issa Qiieh, School of Electrical and Electronic Engineering, Newcastle University, GB

**Authors**: Issa Qiieh, Rishad Shafik, Ghaith Tarawneh, Dani Sokolov and Alex Yakovlev, Newcastle University, GB

**Abstract**

Approximate arithmetic has recently emerged as a promising paradigm for many imprecision-tolerant applications. It can offer substantial reductions in circuit complexity, delay and energy consumption by relaxing accuracy requirements. In this paper, we propose a novel energy-efficient approximate multiplier design using a significance-driven logic compression (SDLC) approach. Fundamental to this approach is an algorithmic and configurable lossy compression of the partial product rows based on their progressive bit significance. This is followed by the commutative remapping of the resulting product terms to reduce the number of product rows. As such, the complexity of the multiplier in terms of logic cell counts and lengths of critical paths is drastically reduced. A number of multipliers with different bit-widths (4-bit to 128-bit) are designed in SystemVerilog and synthesized using Synopsys Design Compiler. Post-synthesis experiments showed that up to an order of magnitude energy savings, and reductions of 65% in critical delay and almost 45% in silicon area can be achieved for a 128-bit multiplier compared to an accurate equivalent. These gains are achieved with low accuracy losses estimated at less than 0.00071 mean relative error. Additionally, we demonstrate the energy-accuracy trade-offs for different degrees of compression, achieved through configurable logic clustering. In evaluating the effectiveness of our approach, a case study image processing application showed up to 68.3% energy reduction with negligible losses in image quality expressed as peak signal-to-noise ratio (PSNR).

**Download Paper (PDF; Only available from the DATE venue WiFi)**

### Energy-Efficient Hybrid Stochastic-Binary Neural Networks for Near-Sensor Computing

**Presentation Title**: ENERGY-EFFICIENT HYBRID STOCHASTIC-BINARY NEURAL NETWORKS FOR NEAR-SENSOR COMPUTING

**Speaker**: Vincent Lee, University of Washington, US

**Authors**: Vincent Lee, Armin Alaghi, John Hayes, Visvesh Sathe and Luis Ceze

**Abstract**

Recent advances in neural networks (NNs) exhibit unprecedented success at transforming large, unstructured data streams into compact higher-level semantic information for tasks such as handwriting recognition, image classification, and speech recognition. Ideally, systems would employ near-sensor computation to execute these tasks at sensor endpoints to maximize data reduction and minimize data movement. However, near-sensor computing presents its own set of challenges such as operating power constraints, energy budgets, and communication bandwidth capacities. In this paper, we propose a stochastic-binary hybrid design which splits the computation between the stochastic and binary domains for near-sensor NN applications. In addition, our design uses a new stochastic adder and multiplier that are significantly more accurate than existing adders and multipliers. We also show that retraining the binary portion of the NN computation can compensate for precision losses introduced by shorter stochastic bit-streams, allowing faster run times at minimal accuracy losses. Our evaluation shows that our hybrid stochastic-binary design can achieve 9.8× energy efficiency savings, and application-level accuracies within 0.05% compared to conventional all-binary designs.

**Download Paper (PDF; Only available from the DATE venue WiFi)**

### Accelerator-Friendly Neural-Network Training: Learning Variations and Defects in RRAM Crossbar

**Presentation Title**: ACCELERATOR-FRIENDLY NEURAL-NETWORK TRAINING: LEARNING VARIATIONS AND DEFECTS IN RRAM CROSSBAR

**Speaker**: Li Jiang, Shanghai Jiao Tong University, CN

**Authors**: Lerong Chen, Jiaowen Li, Yiran Chen, Qiuping Deng, Jiyuan Shen and Li Jiang

**Abstract**

RRAM crossbar consisting of memristor devices can naturally carry out the matrix-vector multiplication; it thereby has gained a great momentum as a highly energy-efficient accelerator for neuro-morphic computing. The resistance variations and stuck-at faults in the memristor devices, however, dramatically degrade not only the chip yield, but also the classification accuracy of the neural-networks running on the RRAM crossbar. Existing hardware-based solutions cause enormous overhead and power consumption, while software-based solutions are less efficient in tolerating stuck-at faults and large variations. In this paper, we propose an accelerator-friendly neural-network training method, by leveraging the inherent self-healing capability of the neural-network, to prevent the large-weight synapses from being mapped to the abnormal memristors based on the fault/variation distribution in the RRAM crossbar. Experimental results show the proposed method can pull the classification accuracy (10%-45% loss in previous works) up close to ideal level with ≤ 1% loss.

**Download Paper (PDF; Only available from the DATE venue WiFi)**
analyzing all the cache level configurations to determine and minimize the susceptibility of the caches to soft errors.

The bank placement in GPUs' last level cache, with the goal of maximizing the performance of the GPU's on-chip network.

management policy for GPGPUs with hybrid main memories that significantly improve performance for memory intensive workloads.

Cache memory design optimizations and management can have a significant effect on cost, performance, and reliability.

Cristina Silvano, Politecnico di Milano, IT

Co-Chair:

Dionisios Pnevmatikatos, Technical University of Crete, GR

Chair:

Location / Room:

Date:

2.3 Cache memory management for performance and reliability

Date: Tuesday 28 March 2017

Time: 11:30 - 13:00

Location / Room: 2BC

Chair:

Dionisios Pnevmatikatos, Technical University of Crete, GR

Co-Chair:

Cristina Silvano, Politecnico di Milano, IT

Cache memory design optimizations and management can have a significant effect on cost, performance, and reliability. The first paper proposes an asymmetric cache management policy for GPGPUs with hybrid main memories that significantly improve performance for memory intensive workloads. The second paper targets the optimization of the bank placement in GPUs' last level cache, with the goal of maximizing the performance of the GPU's on-chip network. The third paper proposes a methodology for jointly analyzing all the cache level configurations to determine and minimize the susceptibility of the caches to soft errors.
SOFT ERROR-AWARE ARCHITECTURAL EXPLORATION FOR DESIGNING RELIABILITY ADAPTIVE CACHE HIERARCHIES IN MULTI-CORES

Abstract
Mainstream multi-core processors employ large multi-level on-chip caches making them highly susceptible to soft errors. We demonstrate that designing a reliable cache hierarchy requires understanding the vulnerability interdependencies across different cache levels. This involves vulnerability analyses depending upon the parameters of different cache levels (partition size, line size, etc.) and the corresponding cache access patterns for different applications. This paper presents a novel soft error-aware cache architectural space exploration methodology and vulnera-bility analysis of multi-level caches considering their vulnerability interdependencies. Our technique significantly reduces exploration time while providing reliability-efficient cache configurations. We also show applicability/benefits for ECC-protected caches under multiple-bit fault scenarios.

Download Paper (PDF; Only available from the DATE venue WiFi)

SHARED LAST-LEVEL CACHE MANAGEMENT FOR GPGPUS WITH HYBRID MAIN MEMORY

Abstract
Memory intensive workloads become increasingly popular on general purpose graphics processing units (GPGPUs), and impose great challenges on the GPGPU memory subsystem design. On the other hand, with the recent development of non-volatile memory (NVM) technologies, hybrid memory combining both DRAM and NVM achieves high performance, low power and high density simultaneously, which provides a promising main memory design for GPGPUs. In this work, we explore the shared last-level cache management for GPGPUs with consideration of the underlying hybrid main memory. In order to improve the overall memory subsystem performance, we exploit the characteristics of both the asymmetric read/write latency of the hybrid main memory architecture, as well as the memory coalescing feature of GPGPU. In particular, to reduce the average cost of L2 cache misses, we prioritize cache blocks from DRAM or NVM based on observation that operations to NVM part of main memory have large impact on the system performance. Furthermore, the cache management scheme also integrates the GPU memory coalescing and cache bypassing techniques to improve the overall cache hit ratio. Experimental results show that in the context of a hybrid main memory system, our proposed L2 cache management policy improves performance against the traditional LRU policy and a state-of-the-art GPU cache strategy EABP [20] by up to 27.76% and 14%, respectively.

Download Paper (PDF; Only available from the DATE venue WiFi)

DROOP MITIGATING LAST LEVEL CACHE ARCHITECTURE FOR SDRAM

Abstract
Spin-Transfer Torque magnetic Random Access Memory (STT-RAM) is one of the emerging technologies in the Domain of Non-volatile dense memories especially preferred for the last level cache (LLC). The amount of current needed to reorient the magnetization at present (~100μA per bit) is too high, especially for the write operation. When we perform a full cache line (512-bit) Write, this extremely high current compared to MRAM will result in a Voltage droop in the conventional cache architecture. Due to this droop, the write operation will fail half way through when we attempt to write in the farthest Bank of the cache from the supply. In this paper, we will be proposing a new cache architecture to mitigate this problem of droop and make the write operation successful. Instead of continuously writing the entire Cache line (512-bit) in a single bank, our architecture will be writing these 512-bits in multiple different locations across the cache in parts of 8 (64-bit each). The various simulation results obtained (both circuit and micro-architectural) comparing our proposed architecture against the conventional are presented in detail.

Download Paper (PDF; Only available from the DATE venue WiFi)
2.5 Reliability and Energy-Efficiency: Two Pillars of NoC Design

Date: Tuesday 28 March 2017
Time: 11:30 - 13:00
Location / Room: 3C

Chair: Sebastien Le Beux, Ecole Central du Lyon, FR

Early performance and power estimation is critical for computer system design. This session covers novel analytical and semi-analytical approaches for fast and accurate modeling of different system components, including GPUs, DRAMs and caches.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>11:30</td>
<td>2.4.1</td>
<td>GATSIM: ABSTRACT TIMING SIMULATION OF GPUS</td>
<td>Andreas Gerstlauer, The University of Texas at Austin, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Authors: Kishore Punniyamurthy, Behzad Boroujerdi and Andreas Gerstlauer, The University of Texas at Austin, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Abstract General-Purpose Graphic Processing Units (GPUs) have become an integral part of heterogeneous system architectures. Ever increasing complexities have made rapid, early performance evaluation of GPU-based architectures and applications a primary design concern. Traditional cycle-accurate GPU simulators are too slow, while existing analytical or source-level estimation approaches are often inaccurate. This paper proposes a novel abstract GPU performance simulation approach that is based on flexible separation of functional and timing models, combining a fast functional execution either on existing simulators or native GPU hardware with a light, fast and accurate abstract timing model. Micro-architecture timing of individual GPU cores is abstracted through static, one-time pre-characterization of code, and only the dynamic scheduling effects are simulated. Using a native GPU for functional execution and excluding pre-characterization, our GPU simulation achieves a throughput of more than 80 MIPS. This is on average 400x faster with 4% error compared to a cycle-accurate GPU simulator for standard GPU benchmarks. Moreover, our simple timing model provides flexibility to target different GPU configurations with little or no extra effort.</td>
</tr>
<tr>
<td>12:00</td>
<td>2.4.2</td>
<td>MESAP: A FAST ANALYTIC POWER MODEL FOR DRAM MEMORIES</td>
<td>Sandeep Poddar, IBM Research, The Netherlands, NL</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Authors: Sandeep Poddar(^1), Rik Jongerius(^1), Leandro Florin(^1), Giovanni Mariani(^2), Gero Dittmann(^2), Andrea Anghel(^2) and Henk Corporaal(^3)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Abstract The design of an energy-efficient memory subsystem is one of the key issues that system architects face today. To achieve this goal, architects usually rely on system simulators and trace-based DRAM power models. However, their long execution makes the approach infeasible for the design-space exploration of next-generation exascale computing systems. Analytic models, in contrast, are orders of magnitude faster. In this paper, we propose a new analytic memory scheduler-agnostic power model (Mesap) for DRAM. Our model achieves an average error of 20% for DDR3 and DDR4 memory systems, similar to a state-of-the-art trace-based approach but our analytic model is an order of magnitude faster. Furthermore, we integrate Mesap into an analytic performance model of general-purpose processors and show its applicability to the design of a computing system targeting scientific image processing applications.</td>
</tr>
<tr>
<td>12:30</td>
<td>2.4.3</td>
<td>AFEC: AN ANALYTICAL FRAMEWORK FOR EVALUATING CACHE PERFORMANCE IN OUT-OF-ORDER PROCESSORS</td>
<td>Kecheng Ji, Southeast University, CA</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Authors: Kecheng Ji(^1), Ming Ling(^1), Qin Wang(^1), Longxing Shi(^2) and Jiaping Pan(^2)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Abstract Evaluating cache performance is becoming critically important to predict the overall performance of out-of-order processors. Non-blocking caches, which are very common in out-of-order CPUs, can reduce the average cache miss penalty by overlapping multiple outstanding memory requests and merging different cache misses with the same cacheline address into one memory request. Normally, memory-level-parallelism (MLP) has been used as a metric to describe the concurrency of memory access. Unfortunately, due to the extremely dynamic dependencies among the program memory references, it is very difficult to quantify MLP without time-consuming simulations. Moreover, the merging of multiple cache misses, which makes the average cache miss service time less than the physical DDR access latency, is seldom considered in the existing researches. In this paper, we propose a cache performance evaluation framework based on program trace analysis and analytical models to fast estimate MLP and the effective cache miss service time without simulations. Comparing with the results by Gem5 simulations of MobyBench 2.0, Mibench 1.0 and Medabench II, the average accuracy of the modeled MLP and the average cache miss service time is higher than 91% and 92%, respectively. Combined with cache misses calculated by the stack distance theory, the average absolute error of CPU stall time (due to cache misses) is lower than 10%, while the evaluation time can be sped up by 35 times relative to the Gem5 full simulations.</td>
</tr>
<tr>
<td>13:00</td>
<td>88</td>
<td>MODELING INSTRUCTION CACHE AND INSTRUCTION BUFFER FOR PERFORMANCE ESTIMATION OF VLIW ARCHITECTURES USING NATIVE SIMULATION</td>
<td>Omayma Matoussi, Grenoble INP, TIMA laboratory, FR</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Authors: Omayma Matoussi(^1) and Frédéric Pétrot(^2)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>TIMA Laboratory at Grenoble, FR; TIMA Laboratory, Grenoble Institute of Technology, FR</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Abstract In this work, we propose an icache performance estimation approach that focuses on a component necessary to handle the instruction parallelism in a very long instruction word (VLIW) processor: the instruction buffer (IB). Our annotation approach is founded on an intermediate level native- simulation framework. It is evaluated with reference to a cycle accurate instruction set simulator leading to an average cycle count error of 9.3% and an average speedup of 10.</td>
</tr>
<tr>
<td>13:30</td>
<td></td>
<td>End of session</td>
<td>Lunch Break in Garden Foyer</td>
</tr>
</tbody>
</table>

Download Paper (PDF; Only available from the DATE venue WiFi)
This session addresses challenges related to energy efficiency and reliability of NoCs. The first paper proposes an analytical approach to evaluate the reliability of adaptive routing algorithms. In the second paper, an online monitoring and routing approach is proposed to address the aging-related degradation in electrical NoC. Finally, the third paper shows how to use network traffic-aware spatial parallelism to improve the energy efficiency of the Epiphany SoC.

### 2.5.1 Reliability Assessment of Fault Tolerant Routing Algorithms in Networks-on-Chip: An Analytic Approach

**Speaker:** Sadia Moriam, Technische Universität Dresden, DE

**Authors:** Sadia Moriam and Gerhard Fettweis, Technische Universität Dresden, DE

**Abstract:** Rapid scaling of transistor gate sizes has significantly increased the density of on-chip integrations and paved the way for many-core systems-on-chip with highly improved performances. The design of the interconnection network of these complex systems is a critical one and the network-on-chip is now the accepted efficient interconnect for such large core arrays. An unfortunate adverse effect of technology scaling is the increased susceptibility to failures resulting in failing links and routers in the network-on-chip. To keep the network connected, efficient fault adaptive routing algorithms are necessary to route around faults. To design and evaluate the fault resiliency of such adaptive routing algorithms, fast, accurate and flexible analytic models are required, especially in large networks for which simulations are extremely time costly. In this paper, we present an analytic approach to evaluate the reliability of adaptive routing algorithms based on algebraic manipulations of the channel dependency matrix. It allows also to evaluate the number of alternate paths between source-destination pairs, in the presence of any number of permanent faults in the network. The analytic model is general and can be adapted to evaluate network reliability for any network topology and with any adaptive routing algorithm based on the turn model. We present cycle-accurate simulations to compare the accuracy of the model for the 2-D mesh and the hexagonal networks. The model is able to estimate the network fault resilience with an accuracy of about 1% and more than 70 times faster than the cycle accurate simulation.

Download Paper (PDF; Only available from the DATE venue WiFi)

### 2.5.2 Online Monitoring and Adaptive Routing for Aging Mitigation in NoCs

**Speaker:** Nader Bagherzadeh, University of California, Irvine, US

**Authors:** Zana Ghaderi, Ayed Alqahtani and Nader Bagherzadeh, University of California, Irvine, US

**Abstract:** Scalability of Network-on-Chip (NoC) as a promising solution for many-core systems can be jeopardized due to reliability challenges such as aging in advanced silicon technology. Previous mitigation techniques to protect NoC are either offline, while aging is strictly influenced by runtime operating conditions, or impose significant overheads to the system. This paper presents an online monitoring method through a Centralized Aging Table (CAT) for routers in NoCs. Router's capacity in flits, which are the main stimuli in routers, is predictable and limited for a given period of time. Consequently, stress rate and temperature, which are the major sources of aging mechanisms such as Bias Temperature Instability (BTI) and Hot Carrier Injection (HCI), will be in the predictable ranges, as well. Hence, our methodology uses CAT which is populated by values that represent aging degradation for each different pairs of stress and temperature ranges during a given period of time. Furthermore, utilizing CAT, we propose an online adaptive aging-aware routing algorithm in order to avoid highly aged routers which eventually leads to age balancing between routers. Additionally, our proposed routing algorithm reduces maximum age of routers by changing the shortest paths between source-destination pairs adaptively, considering routers' ages across them in each given period of time. Extensive experimental analysis using gem5 simulator demonstrates that our online routing algorithm and monitoring methodology, CAT, improves delay degradation of maximum aged router and aging imbalance on average by 39% and 52% compared to XY routing, respectively. The impact of our proposed methodology on network latency, Energy-Delay-Product (EDP) and link utilization is negligible.

Download Paper (PDF; Only available from the DATE venue WiFi)

### 2.5.3 EBSP: Managing NoC Traffic for BSP Workloads on the 16-Core Adapteva Epiphany-III Processor

**Authors:** Siddhartha 1 and Nachiket Kapre-2

1Nanyang Technological University, SG; 2University of Waterloo, CA

**Abstract:** We can deliver high performance and energy efficient operation on the multi-core NoC-based Adapteva Epiphany-III SoC for bulk-synchronous workloads using our proposed EBSP communication API. We characterize and automate per-formance tuning of spatial parallelism for supporting (1) ran-dom access load-store style traffic suitable for irregular sparse computations, as well as (2) variable, data-dependent traffic patterns in neural networks or PageRank-style workloads in a manner tailored for the Epiphany NoC. We aggressively optimize traffic by exposing spatial communication structure to the fabric through offline pre-computation of destination addresses, un-rolling of message-passing loops, selective squeezing of messages, and careful ordering of communication and compute. Using our approach, across a range of applications and datasets such as Sparse Matrix-Vector multiplication (Matrix Market datasets), PageRank (BerKStan SNAP dataset), and Izhikevich spiking neural evaluation, we deliver speedups of 6.5-10× while lowering power use by 2× over optimized ARM-based mappings. When compared to optimized OpenMP x86 mappings, we observe a 11-31× improvement in energy efficiency (GFLOP/s/W) for the Epiphany SoC. Epiphany is also able to beat state-of-the-art spatial FPGA (ZC706) and embedded GPU (Jetson TK1) mappings due to our communication optimizations. Our library is open-source and available at github.com/sidmont/ebsp.git.

Download Paper (PDF; Only available from the DATE venue WiFi)

### 2.6 Advancing Test for Mixed-Signal and Microfluidic Circuits and Systems

**Date:** Tuesday 28 March 2017

**Time:** 11:30 - 13:00

**Location / Room:** SA

**Chair:** Andre Ivanov, Univ. BC, CA

**Co-Chair:** Marie-Minerve Lourier, Univ. Pierre et Marie Curie, FR

Papers in this session discuss latest advances and methodologies for test, including the application of machine learning and sensitivity analysis to mixed-signal circuits, and also presents novel solutions to the test of microfluidic systems.
Testing analog, mixed-signal and RF circuits represents the main cost component for testing complex SoCs. A promising solution to alleviate this cost is the machine learning-based test strategy. These test techniques are an indirect test approach that replaces costly specification measurements by simpler signatures. Machine learning algorithms are used to map these signatures to the performance parameters. Although this approach has a number of undeniable advantages, it also opens new issues that have to be addressed before it can be widely adopted by the industry. In this paper we present a machine learning-based test for a complex mixed-signal system - i.e. a state-of-the-art pipeline ADC - that includes digital calibration. This paper shows how the introduction of digital calibration for the ADC has a serious impact in the proposed test as calibration completely decorrelates signatures from the target specification in the presence of local mismatch.

Download Paper (PDF; Only available from the DATE venue WiFi)
CZ; 1Norwegian University of Science and Technology, NO; 2Technische Universität München, DE; 3Technische Universität Dresden, DE; 4T4Innovations, Ostrava, CZ; 5Universität Stuttgart, DE

Abstract
In both the embedded systems and High Performance Computing domains, energy-efficiency has become one of the main design criteria. Efficiently utilizing the resources provided in computing systems ranging from embedded systems to current petascale and future Exascale HPC systems will be a challenging task. Suboptimal designs can potentially cause large amounts of underutilized resources and wasted energy. In both domains, a promising potential for improving efficiency of scalable applications stems from the significant degree of dynamic behaviour, e.g., runtime alternation in application resource requirements and workloads. Manually detecting and leveraging this dynamism to improve performance and energy-efficiency is a tedious task that is commonly neglected by developers. However, using an automatic optimization approach, application dynamism can be analysed at design time and used to optimize system configurations at runtime. The European Union Horizon 2020 READEX (Runtime Exploitation of Application Dynamism for Energy-efficient eXascale computing) project will develop a tools-aided auto-tuning methodology inspired by the system scenario methodology used in embedded systems. Dynamic behaviour of HPC applications will be exploited to achieve improved energy-efficiency and performance. Driven by a consortium of European experts from academia, HPC resource providers, and industry, the READEX project aims at developing the first of its kind generic framework to split design time and runtime automatic tuning while targeting heterogeneous system at the Exascale level. This paper describes plans for the project as well as early results achieved during its first year. Furthermore, it is shown how project results will be brought back into the embedded systems domain.

Download Paper (PDF; Only available from the DATE venue WiFi)
13:00

**BASTION: BOARD AND SOC TEST INSTRUMENTATION FOR AGEING AND NO FAILURE FOUND**

Speaker: Matteo Sonza, Reorda, IT

Authors: Erik Larson1, Matteo Sonza Reorda2, Maksim Jenihiin3, Jean Raik4, Hans Kerkhoff5, Rene Krenz-Baath6 and Piet Engelke7

1Lund University, SE; 2Politecnico di Torino - DAUN, IT; 3Tilburg University of Technology, EE; 4Klitten University of Technology, EE; 5University of Twente / CTIT/TDT, NL; 6Hochschule Hamm-Lippstadt University of applied Sciences, DE; 7Infineon Technologies, DE

**Abstract**

This is an overview paper that motivates and describes performed work done in the European Commission funded research project BASTION, which focuses on two critical problems of modern electronics: the No-Fault-Found (NFF) and CMOS ageing. New defect classes contributing to NFF have been identified, including timing related faults (TRF) at board level and intermittent resistive faults (IRF) at IC level. BASTION has addressed the mechanisms of ageing and developed several techniques to improve the longevity of electronic products. Embedded Instrumentation, monitors, and IEEE 1687 standard for reconfigurable scan networks (RSN) are seen as an important leverage that helped mitigating the impact of the above listed problems by facilitating a low-latency, scalable online system health monitoring and error localization infrastructure as well as integration of all heterogeneous technologies into a homogeneous demonstration platform. This paper helps the reader to get a general overview of the work performed and provides a collection of references to publications where the respective research results are described in detail.

Download Paper (PDF; Only available from the DATE venue WiFi)

13:05

**RETHINK BIG: EUROPEAN ROADMAP FOR HARDWARE AND NETWORKING OPTIMIZATIONS FOR BIG DATA**

Speaker: Osman Unsal, Barcelona Supercomputing Center, ES

Authors: Gina Alisot1 and Paul Carpenter2

1Barcelona Supercomputing Center, ES; 2BSC, ES

**Abstract**

This paper discusses the results of the RETHINK big Project, a 2-year Collaborative Support Action funded by the European Commission in order to write the European Roadmap for Hardware and Networking optimizations for Big Data. This industry-driven project was led by the Barcelona Supercomputing Center (BSC), and it included large industry partners, SMEs and academia. The roadmap identifies business opportunities from 89 in-depth interviews with 70 European industry stakeholders in the area of Big Data and predicts the future technologies that will disrupt the state of the art in Big Data processing in terms of hardware and networking optimizations. Moreover, it presents coordinated technology development recommendations (focused on optimizations in networking and hardware) that would be in the best interest of European Big Data companies to undertake in concert as a matter of competitive advantage.

Download Paper (PDF; Only available from the DATE venue WiFi)

13:10

**COMPUTING WITH NANO-CROSSBAR ARRAYS: LOGIC SYNTHESIS AND FAULT TOLERANCE**

Speaker: Mustafa Altun, Istanbul Technical University, TR

Authors: Mustafa Altun1, Valentina Ciriani2 and Mehdi Tahoori3

1Istanbul Technical University, TR; 2University of Milan, IT; 3Karlsruhe Institute of Technology, DE

**Abstract**

Nano-crossbar arrays have emerged as a strong candidate technology to replace CMOS in near future. They are regular and dense structures, and can be fabricated such that each crosspoint can be used as a conventional electronic component such as a diode, a FET, or a switch. This is a unique opportunity that allows us to integrate well developed conventional circuit design techniques into nano-crossbar arrays. Motivated by this, our project aims to develop a complete synthesis and performance optimization methodology for switching nano-crossbar arrays that leads to the design and construction of an emerging nanocomputer. First two work packages of the project are presented in this paper. These packages are on logic synthesis with nano-crossbar arrays with area optimization, and fault tolerance that aims to provide a full methodology in the presence of high fault densities and extreme parametric variations in nano-crossbar architectures.

Download Paper (PDF; Only available from the DATE venue WiFi)

13:15

**SECURECLOUD: SECURE BIG DATA PROCESSING IN UNTRUSTED CLOUDS**

Speaker: Rafael Pires, University of Neuchâtel, CH

We present the SecureCloud EU Horizon 2020 project, whose goal is to enable new big data applications that use sensitive data in the cloud without compromising data security and privacy. For this, SecureCloud designs and develops a layered architecture that allows for (i) the secure creation and deployment of secure micro-services; (ii) the secure integration of individual micro-services to full-fledged big data applications; and (iii) the secure execution of these applications within untrusted cloud environments. To provide security guarantees, SecureCloud leverages novel security mechanisms present in recent commodity CPUs, in particular, Intel's Software Guard Extensions (SGX). SecureCloud applies this architecture to big data applications in the context of smart grids. We describe the SecureCloud approach, initial results, and considered use cases.

Download Paper (PDF; Only available from the DATE venue WiFi)

13:20

**WCET-AWARE PARALLELIZATION OF MODEL-BASED APPLICATIONS FOR MULTI-CORES: THE ARGO APPROACH**

Speaker: Steven Derrien, Université de Rennes 1, FR

Authors: Steven Derrien1, Isabelle Puaut2, Panayiotis Alefragis3, Marcus Bednara4, Harald Bucher5, Clément David6, Yann Debray7, Umut Durak8, Imen Fassi9, Christian Ferdinand9, Damien Hardy2, Angeliki Kritikakou2, Gerard Rauwerda9, Simon Reder5, Martin Sicks5, Timo Strifp5, Kim Sunesen9, Timon ter Braak9, Nikolaos Voros3 and Jürgen Becker5

1IRISA, FR; 2University of Rennes 1 / IRISA, FR; 3TGW, GR; 4ISIFranhofer, DE; 5Karlsruhe Institute of Technology, DE; 6Scilab, FR; 7DLR, DE; 8Absint, FR; 9Recore systems, FR

**Abstract**

Parallel architectures are nowadays not only confined to the domain of high performance computing, they are also increasingly used in embedded time-critical systems. The ARGO H2020 project provides a programming paradigm and associated tool flow to exploit the full potential of architectures in terms of development productivity, time-to-market, exploitation of the platform computing power and guaranteed real-time performance. In this paper we give an overview of the objectives of ARGO and explore the challenges introduced by our approach.

Download Paper (PDF; Only available from the DATE venue WiFi)
EXPLORING THE UNKNOWN THROUGH SUCCESSIVE GENERATIONS OF LOW POWER AND LOW RESOURCE VERSATILE AGENTS

Speaker: Martin Andraud, Eindhoven University of Technology, NL

Authors: Martin Andraud\textsuperscript{1} and Marian Verhelst\textsuperscript{2}
\textsuperscript{1}Eindhoven University of Technology, NL; \textsuperscript{2}Katholieke Universiteit Leuven, BE

Abstract The Phoenix project aims to develop a new approach to explore unknown environments, based on multiple measurement campaigns carried out by extremely tiny devices, called agents, that gather data from multiple sensors. These low power and low resource agents are configured specifically for each measurement campaign to achieve the exploration goal in the smallest number of iterations. Thus, the main design challenge is to build agents as much reconfigurable as possible. This paper introduces the Phoenix project in more details and presents first developments in the agent design.

Download Paper (PDF; Only available from the DATE venue WiFi)

13:00 End of session
Lunch Break in Garden Foyer

Keynote Lecture session 3.0 in "Garden Foyer* 1350 - 1420

Lunch Break in the Garden Foyer
On all conference days (Tuesday to Thursday), a buffet lunch will be offered in the Garden Foyer, in front of the session rooms. Kindly note that this is restricted to conference delegates possessing a lunch voucher only. When entering the lunch break area, delegates will be asked to present the corresponding lunch voucher of the day. Once the lunch area is being left, re-entrance is not allowed for the respective lunch.

2.8a Smart Medical Devices
Date: Tuesday 28 March 2017
Time: 11:30 - 12:30
Location / Room: Exhibition Theatre
Organiser: Patrick Mayor, EPFL, CH

The goal of this session is to present concrete examples of smart medical devices, such as a novel surgical robot for hearing implant surgery, a measurement module for the identification of cancer cells through elastic properties, as well as a sensing pad for non-invasive wound monitoring.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>11:30</td>
<td>2.8a.1</td>
<td>HEARRESTORE</td>
<td>Juan Ansó, UniBE, CH</td>
</tr>
<tr>
<td>11:50</td>
<td>2.8a.2</td>
<td>PATLISCI II</td>
<td>Hans Peter Lang, UniBAS, CH</td>
</tr>
<tr>
<td>12:10</td>
<td>2.8a.3</td>
<td>FLUSITEX</td>
<td>Daniel Ahmed, ETHZ, CH</td>
</tr>
<tr>
<td>12:30</td>
<td></td>
<td>End of session</td>
<td></td>
</tr>
<tr>
<td>13:00</td>
<td></td>
<td>Lunch Break in Garden Foyer</td>
<td></td>
</tr>
</tbody>
</table>
Keynote Lecture session 3.0 in "Garden Foyer* 1350 - 1420

Lunch Break in the Garden Foyer
On all conference days (Tuesday to Thursday), a buffet lunch will be offered in the Garden Foyer, in front of the session rooms. Kindly note that this is restricted to conference delegates possessing a lunch voucher only. When entering the lunch break area, delegates will be asked to present the corresponding lunch voucher of the day. Once the lunch area is being left, re-entrance is not allowed for the respective lunch.

2.8b Smart Medical Devices, Part 2
Date: Tuesday 28 March 2017
Time: 12:30 - 13:00
Location / Room: Exhibition Theatre
Organiser: John Zhao, MathWorks, US

Time | Label | Presentation Title | Authors |
|------|-------|--------------------|---------|

Lunch Break in the Garden Foyer
On all conference days (Tuesday to Thursday), a buffet lunch will be offered in the Garden Foyer, in front of the session rooms. Kindly note that this is restricted to conference delegates possessing a lunch voucher only. When entering the lunch break area, delegates will be asked to present the corresponding lunch voucher of the day. Once the lunch area is being left, re-entrance is not allowed for the respective lunch.
MATLAB AND SIMULINK IN THE SMART DEVICES AND BIG DATA ERA

Speaker:
Stefano Olivieri, MathWorks Academia Group, US

Abstract
Smart connected devices and Internet of Things (IoT) are emerging technologies that are impacting diverse industries, including automotive, energy, healthcare, retail, smart manufacturing, smart buildings and homes, smart transportation, etc. Combining internet-connected devices with cloud computing, machine learning, and other data analytics approaches is enabling products and solutions that are transforming the way we live and work. For example, Smart Medical Devices are key components of new products and solutions that may help healthcare professionals to improve health outcomes from anywhere, leading to increased value for the patient.

However, a system developer working on such products and services faces challenges in capturing, storing, and analyzing the Big Data generated from a multitude of devices. Also, integrating Smart Devices, IoT and Big Data raises specific challenges for data acquisition, reduction, and transmission, using increasingly sophisticated technologies such as RFID tags, Wireless Sensor Nodes and mobile devices.

Using the development of a Smart Medical Device based on healthcare application as an example, this presentation will discuss how engineers and scientists creating smart devices and IoT systems use MATLAB and Simulink to access and analyze huge data sets from devices, sensors, and databases; apply deep learning and other machine-learning techniques to develop predictive models; and design and test smart devices that wirelessly interact with cloud services like ThingSpeak™, an analytic IoT platform that can run MATLAB code on demand in the cloud.

13:00
End of session
Lunch Break in Garden Foyer
Keynote Lecture session 3.0 in "Garden Foyer" 1350 - 1420

Lunch Break in the Garden Foyer
On all conference days (Tuesday to Thursday), a buffet lunch will be offered in the Garden Foyer, in front of the session rooms. Kindly note that this is restricted to conference delegates possessing a lunch voucher only. When entering the lunch break area, delegates will be asked to present the corresponding lunch voucher of the day. Once the lunch area is being left, re-entrance is not allowed for the respective lunch.

UB02 Session 2
Date: Tuesday 28 March 2017
Time: 12:30 - 15:00
Location / Room: Booth 1, Exhibition Area

<table>
<thead>
<tr>
<th>Label</th>
<th>Presentation Title</th>
</tr>
</thead>
<tbody>
<tr>
<td>UBO2.1</td>
<td>WORKCRAFT: TOOLSET FOR FORMAL SPECIFICATION, SYNTHESIS AND VERIFICATION OF CONCURRENT SYSTEMS</td>
</tr>
<tr>
<td>Authors</td>
<td>Danil Sokolov, Newcastle University, GB</td>
</tr>
<tr>
<td>Presenter</td>
<td>Davide Quaglia, University of Verona, IT</td>
</tr>
<tr>
<td>EDALab srl, IT</td>
<td></td>
</tr>
<tr>
<td>WORKCRAFT: TOOLSET FOR FORMAL SPECIFICATION, SYNTHESIS AND VERIFICATION OF CONCURRENT SYSTEMS</td>
<td></td>
</tr>
<tr>
<td>Authors</td>
<td>Gianluca Benedetti 1 and Walter Vendramineto 2</td>
</tr>
<tr>
<td>Presenter</td>
<td>Davide Quaglia, University of Verona, IT</td>
</tr>
<tr>
<td>Authors</td>
<td>Gianluca Benedetti 1 and Walter Vendramineto 2</td>
</tr>
<tr>
<td>UBO2.2</td>
<td>WE DARE: WEARABLE ELECTRONICS DIRECTIONAL AUGMENTED REALITY</td>
</tr>
<tr>
<td>Authors</td>
<td>Andrea Emrici, Nokia Bell Labs France, FR</td>
</tr>
<tr>
<td>Presenter</td>
<td>Julia Lallet 1, Imran Latif 1, Ludovic Apvrille 2, Renaud Pacalet 2 and Adrien Canuel 2</td>
</tr>
<tr>
<td>Authors</td>
<td>Julia Lallet 1, Imran Latif 1, Ludovic Apvrille 2, Renaud Pacalet 2 and Adrien Canuel 2</td>
</tr>
<tr>
<td>UBO2.3</td>
<td>TTOOL5G: MODEL-BASED DESIGN OF A 5G UPLINK DATA-LINK LAYER RECEIVER FROM UML/SYSML DIAGRAMS</td>
</tr>
<tr>
<td>Authors</td>
<td>Andreas Emrici, Nokia Bell Labs France, FR</td>
</tr>
<tr>
<td>Presenter</td>
<td>Julia Lallet 1, Imran Latif 1, Ludovic Apvrille 2, Renaud Pacalet 2 and Adrien Canuel 2</td>
</tr>
<tr>
<td>Authors</td>
<td>Julia Lallet 1, Imran Latif 1, Ludovic Apvrille 2, Renaud Pacalet 2 and Adrien Canuel 2</td>
</tr>
</tbody>
</table>
A voltage-scalable fully digital on-chip memory for ultra-low-power IoT processors

Presenter: Jun Shiomi, Kyoto University, JP
Authors: Tohru Ishihara and Hidetoshi Onodera, Kyoto University, JP

Abstract:
A voltage-scalable RISC processor integrating standard-cell based memory (SCM) is demonstrated. Unlike conventional processors, the processor has Standard-Cell based Memories (SCMs) as an alternative to conventional SRAM macros, enabling it to operate at a 0.4 V single-supply voltage. The processor is implemented with the fully automated cell-based design, which leads to low design costs. By scaling the supply voltage and applying the back-gate biasing techniques, the power dissipation of the SCMs is less than 20 µW, enabling the SCMs to operate with ambient energy source only. In this demonstration, the SCMs of the processor operates with a lemon battery as the ambient energy source.

More information...
LABSMILING: A FRAMEWORK, COMPOSED OF A REMOTELY ACCESSIBLE TESTBED AND RELATED SW TOOLS, FOR ANALYSIS AND DESIGN OF LOW DATA-RATE WIRELESS PERSONAL AREA NETWORKS BASED ON IEEE 802.15.4

Presenter:
Marco Santic, University of L'Aquila, IT

Authors:
Luigi Pomante, Walter Tiberti, Carlo Centofanti and Lorenzo Di Giuseppe, DEWS - Università di L'Aquila, IT

Abstract
Low data-rate wireless personal area networks (LR-WPANs) are even more present in the fields of IoT, wearable devices and health monitoring. The development, deployment and test of such systems, based on IEEE 802.15.4 standard (and its derivations, e.g. 15.4e), require the exploitation of a testbed when the network is not trivial and grows in complexity. This demo shows the framework of LabSmiling: a testbed and related SW tools that connect a meaningful (but still scalable) number of physical devices (sensor nodes) located in a real environment. It offers the following services: program, reset, switch on/off single devices; connect to devices up/down links to inject or receive commands/msgs/packets in/from the network; set devices as low level packet sniffers, allowing to test/debug protocol compliances or extensions. Advanced services are: possibility of design test scenarios for the evaluation of network metrics (throughput, latencies, etc.) and custom application verification.

More information ...

15:00 End of session
16:00 Coffee Break in Exhibition Area
On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017
- Coffee Break 10:30 - 11:30
- Coffee Break 16:00 - 17:00

Wednesday, March 29, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 16:00 - 17:00

Thursday, March 30, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 15:30 - 16:00

3.0 LUNCH TIME KEYNOTE SESSION: Precision Medicine: Where Engineering and Life Science meet

Date: Tuesday 28 March 2017
Time: 13:50 - 14:20
Location / Room: Garden Foyer
Chair: David Atienza, EPFL, CH

As we witness the relentless growth of computing power, storage capacity and communication bandwidth, we also see a major trend in bio-medical sciences to become more quantitative and amenable to benefit from the support of electronic systems. Moreover, societal and economic needs push us to develop and adopt health-management approaches that are more effective, less expensive and flexible enough to be personalized to individual and community needs. Within this frame, precision medicine promises to better society by applying engineering technology to personalized health, with devices that are in/on the body and ubiquitously connected. Examples from the Swiss-wide Nano-Tera.ch program will show various techniques related to remote patient monitoring, emergency care as well as routine care. These examples show the advantages that stem from organized and optimized means to quantify clinical data, handle large data sets as well as controlling and personalizing therapy and drug administration.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
</table>

Abstract
As we witness the relentless growth of computing power, storage capacity and communication bandwidth, we also see a major trend in bio-medical sciences to become more quantitative and amenable to benefit from the support of electronic systems. Moreover, societal and economic needs push us to develop and adopt health-management approaches that are more effective, less expensive and flexible enough to be personalized to individual and community needs. Within this frame, precision medicine promises to better society by applying engineering technology to personalized health, with devices that are in/on the body and ubiquitously connected. Examples from the Swiss-wide Nano-Tera.ch program will show various techniques related to remote patient monitoring, emergency care as well as routine care. These examples show the advantages that stem from organized and optimized means to quantify clinical data, handle large data sets as well as controlling and personalizing therapy and drug administration. Electronic design automation is a key technology to realize systems for precision medicine. Examples of specific EDA tools and methods encompass physical design of integrated sensors and their coupling to electronics, simulation of complex systems with bio-chemical stimuli, synthesis of decision making circuitry based on plurality of inexact inputs, policies design for therapies exploiting online data acquisition, and verification of life-critical applications under broad-ranging and unpredictable input conditions. Overall, precision medicine represents an important and large market opportunity. EDA is a necessary underlying technology to realize the promises of better and less expensive care for everyone.

14:20 End of session
3.1 IT&A Session: Parallel Ultra-Low-Power Computing for the IoT: Applications, Platforms, Circuits

Date: Tuesday 28 March 2017
Time: 14:30 - 16:00
Location / Room: SBC

Organisers:
Luca Benini, ETHZ, CH
Davide Rossi, Università di Bologna, IT

Chair:
Luca Benini, ETHZ, CH

Co-Chair:
Davide Rossi, Università di Bologna, IT

This special session will give a deep dive into Ultra-low power computing for Internet-of-Things applications, starting from leading-edge MCU-based commercial solutions, moving to next generation highly-parallel ULP architectures based on open-source hardware & software, fast-forwarding to advanced research solutions based new models of computations.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>14:30</td>
<td>3.1.1</td>
<td>BETTER THAN WORST CASE SIGNOFF STRATEGIES FOR LOW POWER IOT DEVICES</td>
<td>Jose Pineda de Gyvez, NXP Semiconductors, NL</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Portable consumer electronic devices are nowadays ubiquitous. Digital ubiquity, along with a lift in semiconductor utilization for consumer electronics, power autonomy, and device miniaturization are key challenges to attain digital convergence for seamless operability. Most of the state-of-the-art computing architectures are based on power-performance trade-offs. In fact, it is inconceivable to think that without power management any kind of competitive compute solution can be marketed in the entire application field. The relative slow innovation progress on battery technologies demands radical innovations for energy-efficient operation. The inability of battery technologies to keep pace with long operating times required by modern multi-purpose devices necessitates alternative (design) solutions that extend battery lifetime. In this presentation we will focus on signoff techniques aimed to yield designs with smaller area and lower power next to reducing signoff complexity because of severe process variability. More specifically, we make use of standard cell libraries characterized for a lower process spread (e.g. -1σ corner), tighter voltage margin (e.g. Vdd-5%) and typical operating temperature instead of targeting the worst-case PVT corner (e.g. -3σ corner, Vdd-10%, 125oC). We evaluate the proposed techniques in a Cortex-M3 testchip designed in 40nm CMOS process. We will show measurement results that demonstrate the effectiveness of using better than worst case signoffs.</td>
<td></td>
</tr>
<tr>
<td>15:00</td>
<td>3.1.2</td>
<td>GAP: AN OPEN-SOURCE PULP-RISCV PLATFORM FOR NEAR-SENSOR ANALYTICS</td>
<td>Eric Flamand, GreenWaves Technologies, FR</td>
</tr>
<tr>
<td>15:30</td>
<td>3.1.3</td>
<td>ENERGY-QUALITY SCALABLE ADAPTIVE VLSI CIRCUITS AND SYSTEMS BEYOND APPROXIMATE COMPUTING</td>
<td>Massimo Alioto, National University of Singapore, SG</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>In this paper, the concept of energy-quality (EQ) scalable systems is introduced and explored, as novel design dimension to scale down energy in integrated systems for the Internet of Things (IoT). EQ-scalable systems explicitly trade off energy and quality at different levels of abstraction (“vertically”), and sub-systems (“horizontally”), creating new opportunities to improve energy efficiency for a given task and expected “quality”. The concept of quality slack, a taxonomy of techniques to trade off energy and quality and a general EQ-scalable architecture are presented. The generality of the EQ-scaling concept is shown through several examples, ranging from logic to analog circuits, to memories and Analog-Digital Converters. Challenges, opportunities and expected energy gains are discussed to gain an understanding of the potential of the EQ-scalable integrated circuits and systems. As a result, EQ scalable systems are expected to substantially improve the energy efficiency of systems for IoT, compensating the limited energy gains that will be offered by technology and voltage scaling. Download Paper (PDF; Only available from the DATE venue WiFi)</td>
<td></td>
</tr>
</tbody>
</table>
### 3.2 Hot Topic Session: New Benchmarking Vectors for Emerging Devices, Circuits, and Architectures: Energy, Delay, and … Accuracy

**Date:** Tuesday 28 March 2017  
**Time:** 14:30 - 16:00  
**Location / Room:** 4BC

**Organisers:**  
Xiaobo Sharon Hu, University of Notre Dame, US  
Michael Niemier, University of Notre Dame, US

**Chair:**  
Xiaobo Sharon Hu, University of Notre Dame, US  
Co-Chair:  
Pierre-Emmanuel Gaillardon, The University of Utah at Salt Lake City, US

There is ever-growing interest in alternative computational models (e.g., neural networks, etc.), as well as how emerging technologies can best be exploited to address application-level needs. This hot topic session addresses the above issues from the perspective of benchmarking. It considers the impact of emerging devices, circuits, and architectures at the application level in the context of new metrics and benchmarking methodologies being developed via the Semiconductor Research Corporation (SRC). Subsequent presentations highlight benchmarking and design space exploration efforts that consider application-level energy and performance in the context of computational accuracy. They also highlight infrastructure that can be used to compare different devices, circuits, and architectures that ultimately address the same information processing task.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>14:30</td>
<td>3.2.1</td>
<td><strong>BEYOND-CMOS NON-BOOLEAN LOGIC BENCHMARKING: INSIGHTS AND FUTURE DIRECTIONS</strong></td>
<td>Azad Naeemi, Georgia Institute of Technology, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Authors:</strong> Chenyun Pan and Azad Naeemi, Georgia Institute of Technology, US</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>Emerging technologies are facing significant challenges to compete with CMOS with respect to Boolean logic. There is an increasing need for using non-traditional circuits to realize the full potential of beyond-CMOS devices. This paper presents a uniform benchmarking methodology for non-Boolean computation based on the cellular neural network (CNN) for a variety of beyond-CMOS device technologies, including charge- based and spintronic devices. Three types of CNN implementations are benchmarked for a given input noise and recall accuracy target using analog, digital, and spintronic circuits. Results demonstrate that spintronic devices are promising candidates to implement CNNs, where up to 3× EDP improvement is predicted in domain wall devices compared to its conventional CMOS counterpart. This shows that alternative non-Boolean computing platforms are crucial for developing future emerging technologies. Download Paper (PDF; Only available from the DATE venue WiFi)</td>
</tr>
<tr>
<td>15:00</td>
<td>3.2.2</td>
<td><strong>UNDERSTANDING THE DESIGN OF IBM NEUROSYNAPTIC SYSTEM AND ITS TRADEOFFS: A USER PERSPECTIVE</strong></td>
<td>Yiran Chen, Duke University, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Authors:</strong> Hsin-Pai Cheng, Wei Wen, Chunpeng Wu, Sicheng Li, Hai (Helen) Li and Yiran Chen, University of Pittsburgh, US</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>As a large-scale commercial spiking-based neuromorphic computing platform, IBM TrueNorth processor received tremendous attentions in society. However, one of the known issues in TrueNorth design is the limited precision of synaptic weights. The current workaround is running multiple neural network copies in which the average value of each synaptic weight is close to that in the original network. We theoretically analyze the impacts of low data precision in the TrueNorth chip on inference accuracy, core occupation, and performance, and present a probability-biased learning method to enhance the inference accuracy through reducing the random variance of each computation copy. Our experimental results proved that the proposed techniques considerably improve the computation accuracy of TrueNorth platform and reduce the incurred hard-ware and performance overheads. Among all the tested methods, L1TEA regularization achieved the best result, say, up to 2.74% accuracy enhancement when deploying MNIST application on TrueNorth platform. In May 2016, IBM TrueNorth team imple-mented convolutional neural networks (CNN) on TrueNorth pro-cessor and coincidently use a similar method, say, trinary weights, {-1, 0, 1}. It achieves near state-of-the-art accuracy on 8 standard datasets. In addition, to further evaluate TrueNorth performance on CNN, we test similar deep convolutional networks on True North, GPU and FPGA. Among all, GPU has the highest through-put. But if we consider energy consumption, TrueNorth processor is the most energy efficient one, say, &gt; 6000 frames/sec/Watt. Download Paper (PDF; Only available from the DATE venue WiFi)</td>
</tr>
</tbody>
</table>
15:30  3.2.3  CELLULAR NEURAL NETWORK FRIENDLY CONVOLUTIONAL NEURAL NETWORKS - CNNS WITH CNNS

Speaker: Michael Niemier, University of Notre Dame, US
Authors: András Horváth\textsuperscript{1}, Michael Hillmer\textsuperscript{2}, Qiwen Lou\textsuperscript{2}, X, Sharon Hu\textsuperscript{2} and Michael Niemier\textsuperscript{2}

\textsuperscript{1}Pázmány Péter Catholic University, HU; \textsuperscript{2}University of Notre Dame, US

Abstract
This paper will discuss the development and evaluation of a cellular neural network (CeNN)-friendly deep learning network that addresses the MNIST digit recognition problem. Prior work has shown that CeNNs leveraging emerging technologies such as tunnel transistors can improve energy or EDP of CeNNs, while simultaneously offering richer/more complex functionality. Important questions to address are what applications can benefit from CeNNs, and whether CeNNs can eventually outperform other alternatives at the application-level in terms of energy, performance, and accuracy. This paper begins to address these questions by using the MNIST problem as a case study.

Download Paper (PDF; Only available from the DATE venue WiFi)

16:00
End of session

Coffee Break in Exhibition Area
On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017
- Coffee Break 10:30 - 11:30
- Coffee Break 16:00 - 17:00

Wednesday, March 29, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 16:00 - 17:00

Thursday, March 30, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 15:30 - 16:00

3.3 Hardware Trojans and Fault Attacks

Date: Tuesday 28 March 2017
Time: 14:30 - 16:00
Location / Room: 2BC

Chair: Ilia Polian, University of Passau, DE
Co-Chair: Matthias Sauer, University of Freiburg, DE

This section focuses on two types of active attacks on system hardware modules: hardware Trojans (malicious modifications) and fault-injections into cryptographic modules. The papers cover Trojans that target coherence protocols in memory caches; Trojan detection based on measurement of path delays; detection of malware using machine learning; and fault attacks on the cryptographic hash function SHA-3.

14:30  3.3.1  ALGEBRAIC FAULT ANALYSIS OF SHA-3
Speaker: Pei Luo, Northeastern University, US
Authors: Pei Luo, Konstantinos Athanasiou, Yunsi Fei and Thomas Wahl, Northeastern University, US

Abstract
This paper presents an efficient algebraic fault analysis on all four modes of SHA-3 under relaxed fault models. This is the first work to apply algebraic techniques on fault analysis of SHA-3. Results show that algebraic fault analysis on SHA-3 is very efficient and effective due to the clear algebraic properties of Keccak operations. Comparing with previous work on differential fault analysis of SHA-3, algebraic fault analysis can identify the injected faults with much higher rates, and recover an entire internal state of the penultimate round with much fewer fault injections.

Download Paper (PDF; Only available from the DATE venue WiFi)

15:00  3.3.2  EVALUATING COHERENCE-EXPLOITING HARDWARE TROJAN
Speaker: Minsu Kim, Korea University, KR
Authors: Minsu Kim\textsuperscript{1}, Sunhee Kong\textsuperscript{1}, Boeui Hong\textsuperscript{1}, Lei Xu\textsuperscript{2}, Weidong Shi\textsuperscript{2} and Taeweon Suh\textsuperscript{1}

\textsuperscript{1}Korea University, KR; \textsuperscript{2}University of Houston, US

Abstract
Increasing complexity of integrated circuits and IP-based hardware designs have created the risk of hardware Trojans. This paper introduces a new type of threat, a coherence-exploiting hardware Trojan. This Trojan can be maliciously implanted in master components in a system, and continuously injects memory transactions onto the main interconnect. The injected traffic forces the eviction of cache lines, taking advantage of cache coherence protocols. This type of Trojans insidiously slows down the system performance, incurring Denial-of-Service (DoS) attack. We used a Xilinx Zynq-7000 device to implement the Trojan and evaluate its severity. Experiments revealed that the system performance can be severely degraded as much as 258% with the Trojan. A countermeasure to annihilate the Trojan attack is proposed in detail. We also found that AXI version 3.0 supports a seemingly irrelevant invalidation protocol through ACP, opening a door for the potential Trojan attack.

Download Paper (PDF; Only available from the DATE venue WiFi)
This session starts with a guardbanding-based approach that uses cell libraries designed and classified for different temperature ranges for improving circuit timing as well as runtime hardware Trojan (HT) detection methods based on the side channel analysis deeply suffer from the process variations. In order to suppress the effect of the variations, we devise a method that smartly selects two highly correlated paths for each interconnect (edge) that is suspected to have an HT on it. First path is the shortest one passing through the suspected edge and the second one is a path that is highly correlated with the first one. Delay ratio of these paths avoids the detection of the HT inserted circuits. Test results reveal that the method enables the detection of even the minimally invasive Trojans in spite of both inter and intra die variations with the spatial correlations.

Download Paper (PDF; Only available from the DATE venue WiFi)

MALWARE DETECTION USING MACHINE LEARNING BASED ANALYSIS OF VIRTUAL MEMORY ACCESS PATTERNS

MALWARE DETECTION USING MACHINE LEARNING BASED ANALYSIS OF VIRTUAL MEMORY ACCESS PATTERNS

Abstract
Malicious software, referred as malware, continues to grow in sophistication. Past proposals for malware detection have primarily focused on software-based detectors which are vulnerable to being compromised. Thus, recent work has proposed hardware-assisted malware detection. In this paper, we introduce a new framework for hardware-assisted malware detection based on monitoring and classifying memory access patterns using machine learning. This provides for increased automation and coverage through reducing user input on specific malware signatures. The key insight underlying our work is that malware must change control flow and/or data structures, which leaves fingerprints on program memory accesses. Building on this, we propose an online framework for detecting malware that uses machine learning to classify malicious behavior based on virtual memory access patterns. Novel aspects of the framework include techniques for collecting and summarizing per-function/system-call memory access patterns, and a two-level classification architecture. Our experimental evaluation focuses on two important classes of malware (i) kernel rootkits and (ii) memory corruption attacks on user programs. The framework has a detection rate of 99.0% with less than 5% false positives and outperforms previous proposals for hardware-assisted malware detection.

Download Paper (PDF; Only available from the DATE venue WiFi)

POWER PROFILING OF MICROCONTROLLER’S INSTRUCTION SET FOR RUNTIME HARDWARE TROJANS DETECTION WITHOUT GOLDEN CIRCUIT MODELS

Abstract
Globalization trends in integrated circuit (IC) design are leading to increased vulnerability of ICs against hardware Trojans (HT). Recently, several side channel parameters based techniques have been developed to detect these hardware Trojans that require golden circuit as a reference model, but due to the widespread usage of IPs, most of the system-on-chip (SoC) do not have a golden reference. Hardware Trojans in intellectual property (IP)-based SoC designs are considered as major concern for future integrated circuits. Most of the state-of-the-art runtime hardware Trojan detection techniques presume that Trojans will lead to anomaly in the SoC integration units. In this paper, we argue that an intelligent intruder may intrude the IP-based SoC without disturbing the normal SoC operation or violating any protocols. To overcome this limitation, we propose a methodology to extract the power profile of the microcontrollers instruction sets, which is in turn used to train a machine learning algorithm. In this technique, the power profile is obtained by extracting the power behavior of the micro-controllers for different assembly language instructions. This trained model is then embedded into the integrated circuits at the SoC integration level, which classifies the power profile during runtime to detect the intrusions. We applied our proposed technique on MC8051 Trojan benchmarks, shows that we can achieve 87.1% to 99.1% accuracy. To the best of our knowledge, this is the first work in which the power profile of a microcontroller’s instruction set is used in conjunction with machine learning for runtime HT detection.

Download Paper (PDF; Only available from the DATE venue WiFi)

3.4 Guardbanding and Approximation

Date: Tuesday 28 March 2017
Time: 14:30 - 16:00
Location / Room: 3A
Chair: Michael Glass, Ulm University, DE
Co-Chair: Yuko Hara-Azumi, Tokyo Institute of Technology, JP

This session starts with a guardbanding-based approach that uses cell libraries designed and classified for different temperature ranges for improving circuit timing as well as training the models. The cross validation comparison of these learning algorithm, when applied to MC8051 Trojan benchmarks, shows that we can achieve 87%

Download Paper (PDF; Only available from the DATE venue WiFi)
Approximate computing is gaining more and more attention as potential solution to the problem of increasing energy demand in computing. Several recent works focus on the application of deterministic approximate computing to arithmetic computations. Circuits for addition and multiplication are simplified, trading exactness for energy and/or speed. Recent approximation techniques for adders focus on modifications of individual full adders’ truth tables or shortening carry chains. While the resulting error is usually characterized with statistical measures over the range of possible input/output combinations, the actual adder is a static nonlinear system regarding arithmetic operations and signal processing. The resulting unexpected effects present a challenge for adopting approximate computing as a widespread and standard application-level optimization technique. This paper focuses on the deterministic effects of approximate multi-bit adders, which are especially evident for certain input data in an otherwise well specified systems, showing the necessity to look beyond purely statistical measures. We show which fundamental principles are violated depending on the chosen approximation scheme, and how this choice affects practical applications. This can serve as a basis for designers to make informed decisions about the use of approximate adders at the application level.
GAUSSIAN MIXTURE ERROR ESTIMATION FOR APPROXIMATE CIRCUITS

Speaker: Amin Ghasemazar, The University of British Columbia, CA
Authors: Amin Ghasemazar and Mieszko Lis, University of British Columbia, CA

Abstract
In application domains where perceived quality is limited by human senses, where data are inherently noisy, or where models are naturally inexact, approximate computing offers an attractive tradeoff between accuracy and energy or performance. While several approximate functional units have been proposed to date, the question of how these techniques can be systematically integrated into a design flow remains open. Ideally, units like adders or multipliers could be automatically replaced with their approximate counterparts as part of the design flow. This, however, requires accurately modelling approximation errors to avoid compromising output quality. Prior proposals have either focused on describing errors per-bit or significantly limited estimation accuracy to reduce otherwise exponential storage requirements. When multiple approximate modules are chained, these limitations become critical, and propagated error estimates can be orders of magnitude off. In this paper, we propose an approach where both input distributions and approximation errors are modelled as Gaussian mixtures. This naturally represents the multiple sources of error that arise in many approximate circuits while maintaining reasonable memory requirements. Estimation accuracy is significantly better than prior art (up to 7.2× lower Hellinger distance) and errors can be accurately propagated through a cascade of approximate operations; estimates of quality metrics like MSE and MED are within a few percent of simulation-derived values.

Download Paper (PDF; Only available from the DATE venue WiFi)

ENHANCING SYMBOLIC SYSTEM SYNTHESIS THROUGH ASPMT WITH PARTIAL ASSIGNMENT EVALUATION

Speaker: Kai Neubauer, University of Rostock, DE
Authors: Kai Neubauer\(^1\), Philipp Wanko\(^2\), Torsten Schaub\(^2\) and Christian Haubelt \(^1\)
\(^1\)University of Rostock, DE; \(^2\)University of Potsdam, DE

Abstract
The design of embedded systems is becoming continuously more complex such that efficient design methods are becoming crucial for competitive results regarding design time and performance. Recently, combined Answer Set Programming (ASP) and Quantifier Free Integer Difference Logic (QF-IDL) solving has been shown to be a promising approach in system synthesis. However, this approach still has several restrictions limiting its applicability. In the paper at hand, we propose a novel ASP modulo theories (ASPmT) system synthesis approach, which (i) supports more sophisticated system models, (ii) tightly integrates the QF-IDL solving into the ASP solving, and (iii) makes use of partial assignment checking. As a result, more realistic systems are considered and an early exclusion of infeasible solutions improves the entire system synthesis.

Download Paper (PDF; Only available from the DATE venue WiFi)

Coffee Break in Exhibition Area
On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017
- Coffee Break 10:30 - 11:30
- Coffee Break 16:00 - 17:00

Wednesday, March 29, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 16:00 - 17:00

Thursday, March 30, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 15:30 - 16:00
**APPROXIMATE COMPUTING FOR SPIKING NEURAL NETWORKS**

**Speaker:** Sanchari Sen, Purdue University, US

**Authors:** Sanchari Sen, Swaghat Venkataramani and Anand Raghunathan, Purdue University, US

**Abstract**

Spiking Neural Networks (SNNs) are widely regarded as the third generation of artificial neural networks, and are expected to drive new classes of recognition, data analytics and computer vision applications. However, large-scale SNNs (e.g., of the scale of the human visual cortex) are highly compute and data intensive, requiring new approaches to improve their efficiency. Complementary to prior efforts that focus on parallel software and the design of specialized hardware, we propose AxSNN, the first effort to apply approximate computing to improve the computational efficiency of evaluating SNNs. In SNNs, the inputs and outputs of neurons are encoded as a time series of spikes. A spike at a neuron’s output triggers updates to the potentials (internal states) of neurons to which it is connected. AxSNN determines spike-triggered neuron updates that can be skipped with little or no impact on output quality and selectively skips them to improve both compute and memory energy. Neurons that can be approximated are identified by utilizing various static and dynamic parameters such as the average spiking rates and current potentials of neurons, and the weights of synaptic connections. Such a neuron is placed into one of many approximation modes, wherein the neuron is sensitive only to a subset of its inputs and sends spikes only to a subset of its outputs. A controller periodically updates the approximation modes of neurons in the network to achieve energy savings with minimal loss in quality. We apply AxSNN to both hardware and software implementations of SNNs. For hardware evaluation, we designed SNNAP, a Spiking Neural Network Approximate Processor that embodies the proposed approximation strategy, and synthesized it in 45nm technology. The software implementation of AxSNN was evaluated on a 2.7 GHz Intel Xeon server with 128 GB memory. Across a suite of 6 image recognition benchmarks, AxSNN achieves 1.4-5.5X reduction in scalar operations for network evaluation, which translates to 1.2-3.6X and 1.26-3.9X improvement in hardware and software energies respectively, for no loss in application quality. Progressively higher energy savings are achieved with modest reductions in output quality.

Download Paper (PDF; Only available from the DATE venue WiFi)

**REAL-TIME ANOMALY DETECTION FOR STREAMING DATA USING BURST CODE ON A NEUROSYNAPTIC PROCESSOR**

**Speaker:** Qinru Qiu, Syracuse University, US

**Authors:** Qinru Qiu, and Jiwei Chen, Syracuse University, US

**Abstract**

Real-time anomaly detection for streaming data is a desirable feature for mobile devices or unmanned systems. The key challenge is how to deliver required performance under the stringent power constraint. To address the paradox between performance and power consumption, brain-inspired hardware, such as the IBM Neurosynaptic System, has been developed to enable low power implementation of large-scale neural models. Meanwhile, inspired by the operation and the massive parallel structure of human brain, a corelet library, NeoInfer-TN, is developed for basic operations in the architecture. Instead of traditional rate code, burst code is adopted in the design, which represents numerical value using the phase of a burst of spike trains. This does not only reduce the hardware complexity, but also increases the results accuracy. A Corelet library, NeoInfer-TN, is developed for basic operations in the library components. The design can be configured for different tradeoffs between detection accuracy and throughput/energy. We evaluate the system using intrusion detection data streams. The results show higher detection rate than some conventional approaches and real-time performance, with only 50mW power consumption. Overall, it achieves 10^8 operations per watt-second.

Download Paper (PDF; Only available from the DATE venue WiFi)

**FAST, LOW POWER EVALUATION OF ELEMENTARY FUNCTIONS USING RADIAL BASIS FUNCTION NETWORKS**

**Speaker:** Parami Wijesinghe, Purdue University, US

**Authors:** Parami Wijesinghe, Chamika Liyanagedera and Kaushik Roy, Purdue University, US

**Abstract**

Fast and efficient implementation of elementary functions such as sin(), cos(), and log() are of ample importance in a large class of applications. The state of the art methods for function evaluation involves either expensive calculations such as multiplications, large number of iterations, or large Lookup-Tables (LUTs). Higher number of iterations leads to higher latency whereas large LUTs contribute to delay, higher area requirement and higher power consumption owing to data fetching and leakage. We propose a hardware architecture for evaluating mathematical functions, consisting a small LUT and a simple Radial Basis Function Network (RBFN), a type of an Artificial Neural Network (ANN). Our proposed method evaluates trigonometric, hyperbolic, exponential, logarithmic, and square root functions. This technique finds utility in applications where the highest priority is on performance and power consumption. In contrast to traditional ANNs, our approach does not involve multiplication when determining the post synaptic states of the network. Owing to the simplicity of the approach, we were able to attain more than 2.5x power benefits and more than 1.4x performance benefits when compared with traditional approaches, under the same accuracy conditions.

Download Paper (PDF; Only available from the DATE venue WiFi)
### 3.6 Mechanisms for hardware fault testing, recovery and metastability management

**Date**: Tuesday 28 March 2017  
**Time**: 14:30 - 16:00  
**Location / Room**: 5A

**Chair**:  
Jaume Abella, Barcelona Supercomputing Center (BSC), ES

**Co-Chair**:  
Maria K. Michael, University of Cyprus, CY

Papers in this session provide new solutions for dealing with hardware faults and metastability issues, including testing and diagnosing mechanisms for NoCs, fault recovery approaches for 3D ICs, and containment solutions for metastability in sorting networks.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
</table>
| 14:30 | 3.6.1 | **CHARKA: A RELIABILITY-AWARE TEST SCHEME FOR DIAGNOSIS OF CHANNEL SHORTS BEYOND MESH NOCS**  | Santosh Biswas, IIT Guwahati, IN  
**Authors**:  
BiswaJit Bhoomik, 1  
Jatinendra Kumar Deka 2 and Santosh Biswas 2  
1IIT Guwahati, IN; 2I IT GUWAHATI, IN  
**Abstract**  
This paper presents a fast and low cost on-line scheme named "Charka" that analyzes short faults in channels of octagon NoCs. Experimental results demonstrate that the proposed scheme achieves 100% coverage metrics and its on-line evaluation reveals compelling effect of these faults on system performance. We observe that the proposed scheme is up to 9X faster while packet latency is improved by 13.79-21.17% and energy consumption is reduced by 17.57-24.97%. Further, the test area overhead is reduced by 13-26% that shows 52-57.77% improvement.  
**Download Paper (PDF; Only available from the DATE venue WiFi)** |
| 15:00 | 3.6.2 | **RECOVERY-AWARE PROACTIVE TSV REPAIR FOR ELECTROMIGRATION IN 3D ICs**               | Shengcheng Wang, Chair of Dependable Nano Computing (CDNC), Karlsruhe Institute of Technology (KIT), DE  
**Authors**:  
Shengcheng Wang,1  
Hengyang Zhao,2  
Sheldon Tan3 and Mehdi Tahoori1  
1Karlsruhe Institute of Technology, DE; 2University of California, Riverside, US; 3University of California at Riverside, US  
**Abstract**  
Electromigration (EM) becomes a major reliability concern in three-dimensional integrated-circuits (3D ICs). To mitigate this problem, a typical solution is to use TSV redundancy in a reactive manner, maintaining the operability of a 3D chip in the presence of EM failures by detecting and replacing faulty TSVs with spares. In this work, we explore an alternative, more preferred approach to enhance the EM-related lifetime reliability of TSV grid, in which redundancy is used proactively to allow non-faulty TSVs to be temporarily deactivated. In this way, EM wear-out can be reversed by exploiting its recovery property. Applied to 3D benchmark designs, the recovery-aware proactive repair approach increases EM-related lifetime reliability (measured in mean-time-to-failure) of the entire TSV grid by up to 12X relative to the conventional reactive method, with less area overhead.  
**Download Paper (PDF; Only available from the DATE venue WiFi)** |
| 15:30 | 3.6.3 | **NEAR-OPTIMAL METASTABILITY-CONTAINING SORTING NETWORKS**                          | Johannes Bund, Saarland University, DE  
**Authors**:  
Johannes Bund1, Christoph Lenzen2 and Moti Medina 2  
1Saarland University, DE; 2MPI-INF, DE  
**Abstract**  
Metastability in digital circuits is a spurious mode of operation induced by violation of setup/hold times of stateful components. It cannot be avoided deterministically when transitioning from continuously-valued to (discrete) binary signals. However, in prior work (Lenzen & Medina ASYNC 2016) it has been shown that it is possible to fully and deterministically contain the effect of metastability in sorting networks. More specifically, the sorting operation incurs no loss of precision, i.e., any inaccuracy of the output originates from mapping the continuous input range to a finite domain. The downside of this prior result is inefficiency: for B-bit inputs, the circuit for a single comparison contains Theta(B^2) gates and has depth Theta(log B). In this work, we present an improved solution with near-optimal Theta(Blog B) gates and asymptotically optimal Theta(log B) depth. On the practical side, our sorting networks improves over prior work for all input lengths B > 2, e.g., for 16-bit inputs we present an improvement of more than 70% w.r.t. the depth of the sorting network and more than 60% improvement w.r.t. the cost of the sorting network.  
**Download Paper (PDF; Only available from the DATE venue WiFi)** |
### 3DFAR: A THREE-DIMENSIONAL FABRIC FOR RELIABLE MULTICORE PROCESSORS

**Speaker:** Valeria Bertacco, University of Michigan, US  
**Authors:** Javad Bagherzadeh and Valeria Bertacco, University of Michigan, US  

**Abstract**  
In the past decade, silicon technology trends into the nanometer regime have led to significantly higher transistor failure rates. Moreover, these trends are expected to exacerbate with future devices. To enhance reliability, several approaches leverage the inherent core-level and processor-level redundancy present in large chip multiprocessors. However, all of these methods incur high overheads, making them impractical. In this paper, we propose 3DFAR, a novel architecture leveraging 3-dimensional fabrics layouts to efficiently enhance reliability in the presence of faults. Our key idea is based on a fine-grained reconfigurable pipeline for multicore processors, which minimizes routing delay among spare units of the same type by using physical layout locality and efficient interconnect switches, distributed over multiple vertical layers. Our evaluation shows that 3DFAR outperforms state-of-the-art reliable 2D solutions, at a minimal area cost of only 7% over an unprotected design.

Download Paper (PDF; Only available from the DATE venue WiFi)

---

### EVALUATING IMPACT OF HUMAN ERRORS ON THE AVAILABILITY OF DATA STORAGE SYSTEMS

**Speaker:** Hossein Asadi, Sharif University of Technology, IR  
**Authors:** Mostafa Kishani, Reza Eftekhari and Hossein Asadi, Sharif University of Technology, IR  

**Abstract**  
In this paper, we investigate the effect of incorrect disk replacement service on the availability of data storage systems. To this end, we first conduct Monte Carlo simulations to evaluate the availability of disk subsystem by considering disk failures and incorrect disk replacement service. We also propose a Markov model that corroborates the Monte Carlo simulation results. We further extend the proposed model to consider the effect of automatic disk fail-over policy. The results obtained by the proposed model show that overlooking the impact of incorrect disk replacement can result up to three orders of magnitude unavailability underestimation. Moreover, this study suggests that by considering the effect of human errors, the conventional believes about the dependability of different RAID mechanisms should be revised. The results show that in the presence of human errors, RAID1 can result in lower availability compared to RAID5.

Download Paper (PDF; Only available from the DATE venue WiFi)

---

### 3.7 Scheduling and Optimization

**Date:** Tuesday 28 March 2017  
**Time:** 14:30 - 16:00  
**Location / Room:** 3B  

**Chair:** Rolf Ernst, TU Braunschweig, DE  
**Co-Chair:** Kai Lampka, Uppsala University, SE  

This session focuses on methods to optimize the design of real-time embedded systems. The first two presentations cover priority assignment and task partitioning for scheduling on multi-core systems. The last long presentation and interactive presentations focus on architectural and OS considerations.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
</table>
| 14:30 | 3.7.1 | THE CONCEPT OF UNSCHEDULABILITY CORE FOR OPTIMIZING PRIORITY ASSIGNMENT IN REAL-TIME SYSTEMS | Yecheng Zhao, Virginia Polytechnic Institute and State University, US  
Yecheng Zhao and Haibo Zeng, Virginia Tech, US |
### 3.7.2 Addressing Challenges in Today’s Datacenter Systems’ Design

**Speaker:** Saravanan Ramanathan, Nanyang Technological University, SG  
**Authors:** Saravanan Ramanathan and Arvind Easwaran, Nanyang Technological University, SG  
**Abstract**  
Mixed-Criticality (MC) systems consolidate multiple functionalities with different criticalities onto a single hardware platform. Such systems improve the overall resource utilization while guaranteeing resources to critical tasks. In this paper, we focus on the problem of partitioned multiprocessor MC scheduling, in particular the problem of designing efficient partitioning strategies. We develop two new partitioning strategies based on the principle of evenly distributing the difference between total high-critical utilization and total low-critical utilization for the critical tasks among all processors. By balancing this difference, we are able to reduce the pessimism in uniprocessor MC schedulability tests that are applied on each processor, thus improving overall schedulability. To evaluate the schedulability performance of the proposed strategies, we compare them against existing partitioned algorithms using extensive experiments. We show that the proposed strategies are effective with both dynamic-priority Earliest Deadline First with Virtual Deadlines (EDF-VD) and fixed-priority Adaptive Mixed-Criticality (AMC) algorithms. Specifically, our results show that the proposed strategies improve schedulability by as much as 28.1% and 36.2% for implicit and constrained-deadline task systems respectively.  
Download Paper (PDF; Only available from the DATE venue WiFi)

### 3.7.3 Schedulability Using Native Non-Preemptive Groups on an AUTOSAR/OSEK Platform with Caches

**Speaker:** Leo Hatvani, Technische Universität Eindhoven, NL  
**Authors:** Leo Hatvani, Reinder J. Bril 1 and Sebastian Altmeyer 2  
1Technische Universität Eindhoven (TU/e), NL; 2University of Amsterdam (UvA), NL  
**Abstract**  
Fixed-priority preemption threshold scheduling (FPTS) is a limited preemptive scheduling scheme that generalizes both fixed-priority preemptive scheduling (FPSS) and fixed-priority non-preemptive scheduling (FPNS). By increasing the priority of tasks as they start executing it reduces the set of tasks that can preempt any given task. A subset of FPTS task configurations can be implemented natively on any AUTOSAR/OSEK compatible platform by utilizing the platform’s native implementation of non-preemptive task groups via so-called internal resources. The limiting factor for this implementation is the number of internal resources that can be associated with any individual task. OSEK and consequently AUTOSAR limit this number to one internal resource per task. In this work, we investigate the impact of this limitation on the schedulability of task sets when cache related preemption delays are taken into account. We also consider the impact of this restriction on the stack size when the tasks are executed on a shared-stack system.  
Download Paper (PDF; Only available from the DATE venue WiFi)

### 3.8 Addressing Challenges in Today’s Datacenter Systems’ Design

#### 16:00 IP1-18

**Speaker:** Björn Forsberg, ETH Zürich, CH  
**Authors:** Björn Forsberg 1, Andrea Marongiu 2 and Luca Benini 3  
1ETH Zürich, CH; 2Swiss Federal Institute of Technology in Zurich (ETHZ), CH; 3Università di Bologna, IT  
**Abstract**  
The deployment of real-time workloads on commercial off-the-shelf (COTS) hardware is attractive, as it reduces the cost and time-to-market of new products. Most modern high-end embedded SoCs rely on a heterogeneous design, coupling a general-purpose multi-core CPU to a massively parallel accelerator, typically a programmable GPU, sharing a single global DRAM. However, because of non-predictable hardware arbiters designed to maximize average or peak performance, it is very difficult to provide timing guarantees on such systems. In this work we present our ongoing work on GPUguard, a software technique that predictably arbitrates main memory usage in heterogeneous SoCs. A prototype implementation for the NVIDIA Tegra TX1 SoC shows that GPUguard is able to reduce the adverse effects of memory sharing, while retaining a high throughput on both the CPU and the accelerator.  
Download Paper (PDF; Only available from the DATE venue WiFi)

#### 16:01 IP1-19

**Speaker:** Lin Li, Infineon Technologies, DE  
**Authors:** Lin Li 1, Philipp Wagner 2, Albrecht Mayer 1, Thomas Wild 2 and Andreas Herkersdorf 3  
1Infineon Technologies, DE; 2Technical University of Munich, DE; 3TU München, DE  
**Abstract**  
Locks are widely used as a synchronization method to guarantee the mutual exclusion for accesses to shared resources in multi-core embedded systems. They have been studied for years to improve performance, fairness, predictability and a variety of lock implementations optimized for different scenarios have been proposed. In practice, applying an appropriate lock type to a specific scenario is usually based on the developer’s hypothesis, which could mismatch the actual situation. A wrong lock type applied may result in lower performance and unfairness. Thus, a lock profiling tool is needed to increase the system transparency and guarantee the proper lock usage. In this paper, an operating-system-independent lock profiling approach is proposed as there are many different operating systems in the embedded field. This approach detects lock acquisition and lock releasing using hardware tracing based on hardware-level spinlock characteristics instead of specific libraries or APIs. The spinlocks are identified automatically: lock profiling statistics can be measured and performance-harmful lock behaviors are detected. With this information, the lock usage can be improved by the software developer. A prototype as a Java tool was implemented to conduct hardware tracing and analyze locks inside applications running on the Infineon AURIX microcontrollers.  
Download Paper (PDF; Only available from the DATE venue WiFi)

#### 16:00 End of session

**Coffee Break** in Exhibition Area  
On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

- **Tuesday, March 28, 2017**  
  - Coffee Break 10:30 - 11:30  
  - Coffee Break 16:00 - 17:00

- **Wednesday, March 29, 2017**  
  - Coffee Break 10:00 - 11:00  
  - Coffee Break 16:00 - 17:00

- **Thursday, March 30, 2017**  
  - Coffee Break 10:00 - 11:00  
  - Coffee Break 15:30 - 16:00
### 3.8.1 SERVER BENCHMARKING AND DESIGN WITH CLOUDSUITE 3.0

**Speaker:** Javier Picorel, EPFL, CH

**Abstract**

Since its inception, CloudSuite (cloudsuite.ch) has emerged as a popular suite of benchmarks both in industry and among academics for the performance evaluation of cloud services. The EuroCloud Server project blueprinted key optimizations in server SoCs based on the salient features of CloudSuite benchmarks that lead to an order of magnitude improvement in efficiency while preserving QoS. ARM-based server products (e.g., Cavium ThunderX) have now emerged following these guidelines and showcasing the improved efficiency. CloudSuite 3.0 is a major enhancement over prior releases both in benchmarks and infrastructure. It includes benchmarks that represent massive data manipulation with tight latency constraints such as in-memory data analytics using Apache Spark, a new real-time video streaming benchmark following today’s most popular video-sharing website setups, and a new web serving benchmark mirroring today’s multi-tier web server software stacks. To ease the deployment of CloudSuite into private and public cloud systems, the benchmarks are integrated into the Docker software container system and Google’s PerfKit Benchmark. Docker wraps each benchmark into a self-contained software package, guaranteeing the same execution regardless of the environment, while PerfKit automates the process of benchmarking cloud server systems with CloudSuite. CloudSuite 3.0 is supported to run both on real hardware and on our QEMU-based computer architecture simulation framework.

### 3.8.2 PROTECTING DATA IN FARM AND RDMA NETWORKS WITH CATAPULT

**Speaker:** Greg O’Shea, Microsoft, US

**Abstract**

FaRM is an in-memory, transactional database that runs distributed across a cluster of Windows Servers that are connected by a high-speed Remote Direct Memory Access (RDMA) network. Data in FaRM are stored in DRAM and exposed directly to the L2 network by the server’s RDMA network adapters, so that other members of the FaRM cluster can access the data with great efficiency. RDMA enables a network adapter to directly access the memory of another server in the same Ethernet network bypassing the operating system in both servers. This enables low-latency and high-bandwidth data access across the entire cluster. However, RDMA provides no security: the data are also accessible to every other server attached to the same Ethernet network, and message transfers are vulnerable to replay and modification. We present our work to protect data in FaRM using a bump-in-the-wire firewall for RDMA. Based upon the FPGA cards widely deployed in Windows Servers within Microsoft, the firewall exists as a barrier between a FaRM server’s RDMA adapter and the local Ethernet switch. It prevents packets from outside the FaRM cluster from ever reaching the server’s RDMA adapter, and it protects RDMA packets between members of the FaRM cluster by encapsulating them in DTLS tunnels. We show that implementing a similar level of protection in software can be prohibitively expensive.

### 3.9 A tribute to Ralph Otten

**Date:** Tuesday 28 March 2017

**Time:** 14:30 - 16:00

**Location / Room:** Auditorium A

**Organiser:** Giovanni De Micheli, EPFL, CH

**Chair:** Michael Burstein, CEO Billy.com, CA

**Co-Chair:** Giovanni De Micheli, EPFL, CH

**Ralph Otten**  
World renowned leaders in Physical Design will talk about accomplishments in this field over the last four decades, as a tribute to Ralph Otten, pioneer of this field and prematurely died in an accident.

### 3.9.1 CHIP DESIGN - PHYSICAL AND PHILOSOPHICAL

**Author:** Dave Liu, NTHU, TW

### 3.9.2 AUTOMATIC FLOORPLAN DESIGN

**Author:** Martin Wong, University of Illinois at Urbana Champaign, US
### UB03 Session 3

**Date:** Tuesday 28 March 2017  
**Time:** 15:00 - 17:30  
**Location / Room:** Booth 1, Exhibition Area

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>15:00</td>
<td>3.9.3</td>
<td><strong>THE EVOLUTION OF FLOORPLANNING</strong></td>
<td>Antun Domic, Synopsys, US</td>
</tr>
<tr>
<td>15:15</td>
<td>3.9.4</td>
<td><strong>FROM SILICON COMPILER TO PHYSICAL SYNTHESIS: RALPH OTTEN'S CONTRIBUTIONS TO EDA</strong></td>
<td>Patrick Groeneveld, Synopsys, US</td>
</tr>
<tr>
<td>15:30</td>
<td>3.9.5</td>
<td><strong>DEALING WITH EXPLODING DESIGN RULE NUMBERS AND COMPLEXITY</strong></td>
<td>Raul Camposano, Sage Design Automation, US</td>
</tr>
<tr>
<td>15:45</td>
<td>3.9.6</td>
<td><strong>IN MEMORIAM OF RALPH OTTEN: BREAKING DOWN THE COMPLEXITY OF LAYOUT DESIGN UNDER MOORE'S LAW</strong></td>
<td>Jochen Jess, Eindhoven University of Technology, NL</td>
</tr>
<tr>
<td>16:00</td>
<td></td>
<td>End of session</td>
<td></td>
</tr>
</tbody>
</table>

**Coffee Break in Exhibition Area**

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

- **Tuesday, March 28, 2017**
  - Coffee Break 10:30 - 11:30
  - Coffee Break 16:00 - 17:00

- **Wednesday, March 29, 2017**
  - Coffee Break 10:00 - 11:00
  - Coffee Break 16:00 - 17:00

- **Thursday, March 30, 2017**
  - Coffee Break 10:00 - 11:00
  - Coffee Break 15:30 - 16:00

---

**UB03.1**  
**WORKCRAFT: TOOLSET FOR FORMAL SPECIFICATION, SYNTHESIS AND VERIFICATION OF CONCURRENT SYSTEMS**  
**Presenter:** Danil Sokolov, Newcastle University, GB  
**Abstract**

A large number of models that are employed in the field of concurrent systems' design, such as Petri nets, gate-level circuits, dataflow structures have an underlying static graph structure. Their semantics, however, is defined using additional entities, e.g. tokens or node/arc states, which collectively form the overall state of the system. We jointly refer to such formalisms as interpreted graph models. This demo will show the use of an open-source cross-platform Workcraft framework for capturing, simulation, synthesis, and verification of such models. The focus of our case study will be on synthesis from technology-independent formal specifications to verifiable circuit implementations.

More information ...

**UB03.2**  
**RIMEDIO: WHEELCHAIR MOUNTED ROBOTIC ARM DEMONSTRATOR FOR PEOPLE WITH MOTOR SKILLS IMPAIRMENTS**  
**Presenter:** Alessandro Palla, University of Pisa, IT  
**Authors:** Gabriele Meoni and Luca Fanucci, University of Pisa, IT  
**Abstract**

People with reduced mobility experiment many issues in the interaction with the indoor and outdoor environment because of their disability. For those users even the simplest action might be a hard/impossible task to perform without the assistance of an external aid. We propose a simple and lightweight wheelchair mounted robotic arm with the focus on the human-machine interface that has to be simple and accessible for users with different kind of disabilities. The robotic arm is equipped with a 5 MP camera, force and proximity sensors and a 6 axis Inertial Measurement Unit on the end-effector that can be controlled using an app running on a tablet. When the user selects the object to reach (for instance a button) on the tablet screen, the arm autonomously carries out the task, using the camera image and the sensors measurements for autonomous navigation. The demonstrator consists in the robotic arm prototype, the Android tablet and a personal computer for arm setup and configuration.

More information ...

**UB03.3**  
**FLEXPORT: FLEXIBLE PLATFORM FOR OBJECT RECOGNITION & TRACKING TO ENHANCE INDOOR LOCALIZATION AND MAPPING**  
**Presenter:** Marko Röllier, Technische Universität Chemnitz, DE  
**Authors:** Christian Schott, Murali Padmanabha and Ulrich Heinikel, TU Chemnitz, DE  
**Abstract**

Object detection plays a crucial role in realizing intelligent indoor localization and mapping techniques. With the advantages of these techniques comes the complexity of computing hardware and the mobility. While the availability of open source computer vision algorithms and High-Level-Synthesis framework accelerates the development, the hybrid processing architecture of an All Programmable System on Chip (APSoC) enables efficient hardware-software partitioning. Using these tools, a generic platform was designed for evaluating the computer vision algorithms. Open source components such as Linux kernel and OpenCV libraries were integrated for evaluation of the algorithms on the software while Vivado HLS framework was used to synthesize the hardware counter parts. Algorithms such as Sobel filtering and Hough Line transformation were implemented and analyzed. The capabilities of this platform were used to realize a mobile object detection system for enhancing the localization techniques.

More information ...
MATISSE: A TARGET-AWARE COMPILER TO TRANSLATE MATLAB INTO C AND OPENCL

Presenter:
Luís Reis, University of Porto, PT

Authors:
João Bispo and João Cardoso, University of Porto / INESC-TEC, PT

Abstract
Many engineering, scientific and finance algorithms are prototyped and validated in array languages, such as MATLAB, before being converted to other languages such as C for use in production. As such, there has been substantial effort to develop compilers to perform this translation automatically. Alternative types of computation devices, such as GPGPUs and FPGAs, are becoming increasingly more popular, so it becomes critical to develop compilers that target these architectures. We have adapted MATISSE, our MATLAB-compatible compiler framework, to generate C and OpenCL code for these platforms. In this demonstration, we will show how our compiler works and what its capabilities are. We will also describe the main challenges of efficient code generation from MATLAB and how to overcome them.

More information ...

A VOLTAGE-SCALEABLE FULLY DIGITAL ON-CHIP MEMORY FOR ULTRA-LOW-POWER IOT PROCESSORS

Presenter:
Jun Shiomi, Kyoto University, JP

Authors:
Tohru Ishihara and Hitodoshi Ohdora, Kyoto University, JP

Abstract
A voltage-scalable RISC processor integrating standard-cell based memory (SCM) is demonstrated. Unlike conventional processors, the processor has Standard-Cell based Memories (SCMs) as an alternative to conventional SRAM macros, enabling it to operate at a 0.4 V single-supply voltage. The processor is implemented with the fully automated cell-based design, which leads to low design costs. By scaling the supply voltage and applying the back-gate biasing techniques, the power dissipation of the SCMs is less than 20 uW, enabling the SCMs to operate with ambient energy source only. In this demonstration, the SCMs of the processor operates with a lemon battery as the ambient energy source.

More information ...

RUNNING CONVOLUTIONAL LAYERS OF ALEXNET IN NEUROMORPHIC COMPUTING SYSTEM

Presenter:
Yongshin Kang, Incheon National University, KR

Authors:
Seban Kim, Taehwan Shin and Jaeyong Chung, Incheon National University, KR

Abstract
Neuromorphic hardware has drawn attention as an approach to deal with the issues of today’s computing platforms based on Von Neumann architecture when running deep learning models, but large-scale deep neural networks such as AlexNet have not been demonstrated yet in any neuromorphic systems. Since 2014, we have been developing a non-Von Neumann computing system called INSight based on data flow architecture that aims at running large-scale deep neural networks in the neuromorphic fashion. We have now reached a major milestone and will demonstrate INSight running the convolutional layers of AlexNet. The proposed system is implemented with Xilinx Virtex 7 FPGA and performs the processing using 100K synapses mapped on LUTs without any array-type memories. It processes 1552 images per second and consumes 7.2W, resulting in the state-of-the-art energy efficiency.

More information ...

ACCELERATORS: RECONFIGURABLE SELF-TIMED DATAFLOW ACCELERATOR & FAST NETWORK ANALYSIS IN SILICON

Presenter:
Alessandro de Gennaro, Newcastle University, GB

Authors:
Danil Sokolov and Andrey Mokhov, Newcastle University, GB

Abstract
A voltage-scalable RISC processor integrating standard-cell based memory (SCM) is demonstrated. Unlike conventional processors, the processor has Standard-Cell base Memories (SCMs) as an alternative to conventional SRAM macros, enabling its operation at a 0.4 V single-supply voltage. The processor is implemented with the fully automated cell-based design, which leads to low design costs. By scaling the supply voltage and applying the back-gate biasing techniques, the power dissipation of the SCMs is less than 20 uW, enabling the SCMs to operate with ambient energy source only. In this demonstration, the SCMs of the processor operates with a lemon battery as the ambient energy source.

More information ...

NETWORKED LABS-ON-CHIPS

Presenter:
Andreas Gajmer, Johannes Kepler University Linz, AT

Authors:
Werner Haselmayr, Andreas Springer and Robert Wille, Johannes Kepler University Linz, AT

Abstract
Labs-on-Chip (LoC) allow for the miniaturization, integration, and automation of medical and bio-chemical procedures. In recent years, different technologies have been considered. However, all of them have their drawbacks, e.g. electrowetting-based LoCs suffer from the evaporation of liquids, the fast degradation of the surface coatings, and the inferior biocompatibility, while flow-based LoCs require a complex and costly multilayer fabrication process. Hence, an alternative has recently been proposed in terms of Networked Labs-on-Chips. We present and demonstrate the NLoC technology where so-called droplets flow inside channels of micrometer-size. Networking functionalities enable the designer to dynamically select the operations to be conducted. These networking functionalities exploit hydrodynamic forces acting on droplets. Moreover, NLoC devices can be produced at low cost (e.g. using 3D printers). By this, drawbacks of established LoC-technologies are addressed.

More information ...

STACKADROP: A MODULAR DIGITAL MICROFLUIDIC BIOCHIP RESEARCH PLATFORM

Presenter:
Oliver Keszöce, University of Bremen, DE

Authors:
Maximilian Luenert and Rolf Drechsler, University of Bremen & DFKI GmbH, DE

Abstract
Advances in microfluidic technologies have led to the emergence of Digital Microfluidic Biochips (DMFBs), which are capable of automating laboratory procedures. These DMFBs raised significant attention in industry and academia creating a demand for devices. Commercial products are available but come at a high price. So far, there are two open hardware DMFBs available: the DropBot from WheelerLabs and the OpenDrop from GaudiLabs. The aim of the StackADrop was to create a DMFB with many directly addressable cells while still being very compact. The StackADrop strives to provide means to experiment with different hardware setups. It’s main feature are the exchangeable top plates, supporting 256 high-voltage pins. It features SPI, UART and I2C connectors for attaching sensors/actuators and can be connected to a computer using USB for interactive sessions using a control software. The modularity allows to easily test different cell shapes, such as squares, hexagons and triangles.

More information ...
PULP: A ULTRA-LOW POWER PLATFORM FOR THE INTERNET-OF-THINGS

Presenter:
Francesco Conti, ETH Zurich, CH

Authors:
Stefan Mach1, Florian Zaruba1, Antonio Pullini1, Daniele Palossi1, Giovanni Rovere1, Florian Glaser1, Germain Haugou1, Schekeb Fateh1 and Luca Benini2

1ETH Zurich, CH; 2ETH Zurich, CH and University of Bologna, IT

Abstract
The PULP (Parallel Ultra-Low Power) platform strives to provide high performance for IoT nodes and endpoints within a very small power envelope. The PULP platform is based on a tightly-coupled multi-core cluster and on a modular architecture, which can support complex configurations with autonomous I/O without SW intervention, HW-accelerated execution of hot computation kernels, fine-grain event-based computation - but can also be deployed in very simple configuration, such as the open source PULPino microcontroller. In this demonstration booth, we will showcase several prototypes using PULP chips in various configuration. Our prototypes perform demos such as real-time deep-learning based visual recognition from a low-power camera, and online biosignal acquisition and reconstruction on the same chip. Application scenarios for our technology include healthcare wearables, autonomous nano-UAVs, smart networked environmental sensors. More information ...

IP1 Interactive Presentations

Date: Tuesday 28 March 2017
Time: 16:00 - 16:30
Location / Room: IP sessions (in front of rooms 4A and 5A)

Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the afternoon. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding short session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award ‘Best IP of the Day’ is given.

<table>
<thead>
<tr>
<th>Label</th>
<th>Presentation Title</th>
</tr>
</thead>
<tbody>
<tr>
<td>IP1-1</td>
<td>STRUCTURAL DESIGN OPTIMIZATION FOR DEEP CONVOLUTIONAL NEURAL NETWORKS USING STOCHASTIC COMPUTING</td>
</tr>
<tr>
<td>Authors</td>
<td>Yanzhi Wang, Syracuse University, US</td>
</tr>
<tr>
<td>Authors</td>
<td>Zhe Li1, Ji Li2, Qinru Qiu1, Bo Yuan2, Jeffrey Draper2 and Yanzhi Wang1</td>
</tr>
<tr>
<td>Authors</td>
<td>1Syracuse University, US; 2University of Southern California, US; 3City University of New York, New York, City College, US</td>
</tr>
<tr>
<td>Abstract</td>
<td>Deep Convolutional Neural Networks (DCNNs) have been demonstrated as effective models for understanding image content. The computation behind DCNNs highly relies on the capability of hardware resources due to the deep structure. DCNNs have been implemented on different large- scale computing platforms. However, there is a trend that DCNNs have been embedded into light-weight local systems, which requires low power/energy consumptions and small hardware footprints. Stochastic Computing (SC) radically simplifies the hardware implementation of arithmetic units and has the potential to satisfy the small low-power needs of DCNNs. Local connectivities and down-sampling operations have made DCNNs more complex to be implemented using SC. In this paper, eight feature extraction designs for DCNNs using SC in two groups are explored and optimized in detail from the perspective of calculation precision, where we permute two SC implementations for inner-product calculation, two down-sampling schemes, and two structures of DCNN neurons. We evaluate the network in aspects of network accuracy and hardware performance for each DCNN using one feature extraction design out of eight. Through exploration and optimization, the accuracies of SC-based DCNNs are guaranteed compared with software implementations on CPU/GPU/binary-based ASIC synthesis, while area, power, and energy are significantly reduced by up to 776X.</td>
</tr>
<tr>
<td>IP1-2</td>
<td>APPROXQA: A UNIFIED QUALITY ASSURANCE FRAMEWORK FOR APPROXIMATE COMPUTING</td>
</tr>
<tr>
<td>Speaker</td>
<td>Ting Wang, The Chinese University of Hong Kong, HK</td>
</tr>
<tr>
<td>Authors</td>
<td>Ting Wang, Qian Zhang and Qiang Xu, The Chinese University of Hong Kong, HK</td>
</tr>
<tr>
<td>Abstract</td>
<td>Approximate computing, being able to trade off computation quality and computational effort (e.g., energy) by exploiting the inherent error-resilience of emerging applications (e.g., recognition and mining), has garnered significant attention recently. No doubt to say, quality assurance is indispensable for satisfactory user experience with approximate computing, but this issue has remained largely unexplored in the literature. In this work, we propose a novel framework namely ApproxQA to tackle this problem, in which approximation mode tuning and rollback recovery are considered in a unified manner when quality variation occurs. To be specific, ApproxQA resorts to a two-level controller, in which the high-level approximation controller tunes approximation modes at a coarse-grained scale based on Q-learning while the low-level rollback controller judiciously determines whether to perform rollback recovery at a fine-grained scale based on the target quality requirement. ApproxQA can provide statistical quality assurance even when the underlying quality checkers are not reliable. Experimental results on various benchmark applications demonstrate that it significantly outperforms existing solutions in terms of energy efficiency with quality assurance. Download Paper (PDF; Only available from the DATE venue WiFi)</td>
</tr>
<tr>
<td>IP1-3</td>
<td>EVOAPPROX8B: LIBRARY OF APPROXIMATE ADDERS AND MULTIPLIERS FOR CIRCUIT DESIGN AND BENCHMARKING OF APPROXIMATION METHODS</td>
</tr>
<tr>
<td>Speaker</td>
<td>Lukas Sekanina, Brno University of Technology, CZ</td>
</tr>
<tr>
<td>Authors</td>
<td>Vojtech Mrazek, Radek Hrbacek, Zdenek Vasicek and Lukas Sekanina, Brno University of Technology, CZ</td>
</tr>
<tr>
<td>Abstract</td>
<td>Approximate circuits and approximate circuit design methodologies attracted a significant attention of researchers as well as industry in recent years. In order to accelerate the approximate circuit and system design process and to support a fair benchmarking of circuit approximation methods, we propose a library of approximate adders and multipliers called EvoApprox8b. This library contains 430 non-dominated 8-bit approximate adders created from 13 conventional adders and 471 non-dominated 8-bit approximate multipliers created from 6 conventional multipliers. These implementations were evolved by a multi-objective Cartesian genetic programming. The EvoApprox8b library provides Verilog, Matlab and C models of all approximate circuits. In addition to standard circuit parameters, the error is given for seven different error metrics. The EvoApprox8b library is available at: <a href="http://www.fit.vutbr.cz/research/groups/ehw/approxlib">www.fit.vutbr.cz/research/groups/ehw/approxlib</a></td>
</tr>
<tr>
<td>Download Paper (PDF; Only available from the DATE venue WiFi)</td>
<td></td>
</tr>
</tbody>
</table>
Abstract

Spin-Transfer Torque magnetic Random Access Memory (STT-RAM) is one of the emerging technologies in the Domain of Non-volatile dense memories especially preferred for the last level cache (LLC). The amount of current needed to reorient the magnetization at present (~100μA per bit) is too high, especially for the Write operation. When we perform a full cache line (512-bit) Write, this extremely high current compared to MRAM will result in a Voltage drop in the conventional cache architecture. Due to this drop, the write operation will fail half way through when we attempt to write in the farthest Bank of the cache from the supply. In this paper, we will be proposing a new cache architecture to mitigate this problem of droop and make the write operation successful. Instead of continuously writing the entire Cache line (512-bit) in a single bank, our architecture will be writing these 512-bits in multiple different locations across the cache in parts of 8 (64-bit each). The various simulation results obtained (both circuit and micro-architectural) comparing our proposed architecture against the conventional are presented in detail.

Download Paper (PDF; Only available from the DATE venue WiFi)
**SECURECLOUD: SECURE BIG DATA PROCESSING IN UNTRUSTED CLOUDS**

*Speaker:* Rafael Pires, University of Neuchâtel, CH

*Abstract*

We present the SecureCloud EU Horizon 2020 project, whose goal is to enable new big data applications that use sensitive data in the cloud without compromising data security and privacy. For this, SecureCloud designs and develops a layered architecture that allows for (i) the secure creation and deployment of secure microservices; (ii) the secure integration of individual micro-services to full-fledged big data applications; and (iii) the secure execution of these applications within untrusted cloud environments. To provide security guarantees, SecureCloud leverages novel security mechanisms present in recent commodity CPUs, in particular, Intel's Software Guard Extensions (SGX). SecureCloud applies this architecture to big data applications in the context of smart grids. We describe the SecureCloud approach, initial results, and considered use cases.

Download Paper (PDF; Only available from the DATE venue WiFi)

**WCET-AWARE PARALLELIZATION OF MODEL-BASED APPLICATIONS FOR MULTI-CORES: THE ARGO APPROACH**

*Speaker:* Steven Derrien, Université de Rennes 1, FR

*Authors:*

Steven Derrien1, Isabelle Puaut2, Panayiotis Aifragis3, Marcus Bednara4, Harald Bucher5, Clément David6, Yann Debray7, Umut Duraik7, Imen Fassi2, Christian Ferdinand8, Damien Hardy7, Angeliki Kritikakou7, Gerard Rauwerda9, Simon Reder5, Martin Sicks6, Timo Stripf9, Kim Suneisen5, Timon ter Braak9, Nikolaos Voros3 and Jürgen Becker9

1IRISA, FR; 2University of Rennes 1 / IRISA, FR; 3TGW, GR; 4IS/IFranhofer, DE; 5Karlsruhe Institute of Technology, DE; 6Silab, FR; 7DLR, DE; 8Absint, FR; 9Recore systems, FR

*Abstract*

Parallel architectures are nowadays not only confined to the domain of high performance computing, they are also increasingly used in embedded time-critical systems. The ARGO H2020 project provides a programming paradigm and associated tool flow to exploit the full potential of architectures in terms of development productivity, time-to-market, exploitation of the platform computing power and guaranteed real-time performance. In this paper we give an overview of the objectives of ARGO and explore the challenges introduced by our approach.

Download Paper (PDF; Only available from the DATE venue WiFi)

**EXPLORING THE UNKNOWN THROUGH SUCCESSIVE GENERATIONS OF LOW POWER AND LOW RESOURCE VERSATILE AGENTS**

*Speaker:* Martin Andraud, Eindhoven University of Technology, NL

*Authors:*

Martin Andraud1 and Marien Verhelst2

1Eindhoven University of Technology, NL; 2Katholieke Universiteit Leuven, BE

*Abstract*

The Phoenix project aims to develop a new approach to explore unknown environments, based on multiple measurement campaigns carried out by extremely tiny devices, called agents, that gather data from multiple sensors. These low power and low resource agents are configured specifically for each measurement campaign to achieve the exploration goal in the smallest number of iterations. Thus, the main design challenge is to build agents as much reconfigurable as possible. This paper introduces the Phoenix project in more details and presents first developments in the agent design.

Download Paper (PDF; Only available from the DATE venue WiFi)

**POWER PROFILING OF MICROCONTROLLER'S INSTRUCTION SET FOR RUNTIME HARDWARE TROJANS DETECTION WITHOUT GOLDEN CIRCUIT MODELS**

*Speaker:* Falah Awwad, College of Engineering / Department of Electrical Engineering, UAE University, AE

*Authors:*

Falah Awwad1, Syed Rafay Hasan2, Osman Hasan2 and Falah Awwad1

1School of Electrical Engineering and Computer Science National University of Sciences and Technology (NUST), PK; 2Department of Electrical and Computer Engineering, Tennessee Technological University, US; 3College of Engineering, United Arab Emirates University, AE

*Abstract*

Globalization trends in integrated circuit (IC) design are leading to increased vulnerability of ICs against hardware Trojans (HT). Recently, several side channel parameters based techniques have been developed to detect these hardware Trojans that require golden circuit as a reference model, but due to the widespread usage of IP, most of the system-on-chip (SoC) do not have a golden reference IP. In this work, we present a methodology to extract the power profile of the micro-controllers instruction set, which is in turn used to train a machine learning algorithm. In this technique, the power profile is obtained by extracting the power behavior of the micro-controllers for different assembly language instructions. This trained model is then embedded into the integrated circuits at the SoC integration level, which classifies the power profile during runtime to detect the intrusions. We applied our proposed technique on MCB8051 micro-controller in VHDL, obtained the power profile of its instruction set and then applied deep learning, k-NN, decision tree and naive Bayesian based machine learning tools to train the models. The cross validation comparison of these learning algorithms, when applied to MCB8051 Trojan benchmarks, shows that we can achieve 87% to 99% accuracy. To the best of our knowledge, this is the first work in which the power profile of a microprocessor's instruction set is used in conjunction with machine learning for runtime HT detection.

Download Paper (PDF; Only available from the DATE venue WiFi)

**ACCOUNTING FOR SYSTEMATIC ERRORS IN APPROXIMATE COMPUTING**

*Speaker:* Martin Bruestel, Technical University Dresden, DE

*Authors:*

Martin Bruestel1 and Akash Kumar2

1Technical University Dresden, DE; 2Technische Universität Dresden, DE

*Abstract*

Approximate computing is gaining more and more attention as potential solution to the problem of increasing energy demand in computing. Several recent works focus on the application of deterministic approximate computing to arithmetic computations. Circuits for addition and multiplication are simplified, trading exactness for energy and/or speed. Recent approximation techniques for adders focus on modifications of individual full adders' truth tables or shortening carry chains. While the resulting error is usually characterized with statistical measures over the range of possible input/output combinations, the actual adder is a static nonlinear system regarding arithmetic operations and signal processing. The resulting unexpected effects present a challenge for adopting approximate computing as a widespread and standard application-level optimization technique. This paper focuses on the deterministic effects of approximate multi-bit adders, which are especially evident for certain input data in an otherwise well specified systems, showing the necessity to look beyond purely statistical measures. We show which fundamental principles are violated depending on the chosen approximation scheme, and how this choice affects practical applications. This can serve as a basis for designers to make informed decisions about the use of approximate adders at the application level.

Download Paper (PDF; Only available from the DATE venue WiFi)


[PI-14] **GAUSSIAN MIXTURE ERROR ESTIMATION FOR APPROXIMATE CIRCUITS**

**Speaker:** Amin Ghasemazar, The University of British Columbia, CA  
**Authors:** Amin Ghasemazar and Miezsko Lis, University of British Columbia, CA  
**Abstract**  
In application domains where perceived quality is limited by human senses, where data are inherently noisy, or where models are naturally inexact, approximate computing offers an attractive tradeoff between accuracy and energy or performance. While several approximate functional units have been proposed to date, the question of how these techniques can be systematically integrated into a design flow remains open. Ideally, units like adders or multipliers could be automatically replaced with their approximate counterparts as part of the design flow. This, however, requires accurately modeling approximation errors to avoid compromising output quality. Prior proposals have either focused on describing errors per-bit or significantly limited estimation accuracy to reduce otherwise exponential storage requirements. When multiple approximate modules are chained, these limitations become critical, and propagated error estimates can be orders of magnitude off. In this paper, we propose an approach where both input distributions and approximation errors are modeled as Gaussian mixtures. This naturally represents the multiple sources of error that arise in many approximate circuits while maintaining reasonable memory requirements. Estimation accuracy is significantly better than prior art (up to 7.2× lower Hellinger distance) and errors can be accurately propagated through a cascade of approximate operations; estimates of quality metrics like MSE and MED are within a few percent of simulation-derived values.  
Download Paper (PDF; Only available from the DATE venue WiFi)

[PI-15] **ENHANCING SYMBOLIC SYSTEM SYNTHESIS THROUGH ASPMT WITH PARTIAL ASSIGNMENT EVALUATION**

**Speaker:** Kai Neubauer, University of Rostock, DE  
**Authors:** Kai Neubauer1, Philipp Wanko2, Torsten Schaub2 and Christian Haubelt  
1University of Rostock, DE; 2University of Potsdam, DE  
**Abstract**  
The design of embedded systems is becoming continuously more complex such that efficient design methods are becoming crucial for competitive results regarding design time and performance. Recently, combined Answer Set Programming (ASP) and Quantifier Free Integer Difference Logic (QF-IDL) solving has been shown to be a promising approach in system synthesis. However, this approach still has several restrictions limiting its applicability. In the paper at hand, we propose a novel ASP modulo theories (ASPMT) system synthesis approach, which (i) supports more sophisticated system models, (ii) tightly integrates the QF-IDL solving into the ASP solving, and (iii) makes use of partial assignment checking. As a result, more realistic systems are considered and an early exclusion of infeasible solutions improves the entire system synthesis.  
Download Paper (PDF; Only available from the DATE venue WiFi)

[PI-16] **3DFAR: A THREE-DIMENSIONAL FABRIC FOR RELIABLE MULTICORE PROCESSORS**

**Speaker:** Valeria Bertacco, University of Michigan, US  
**Authors:** Javad Bagherzadeh and Valeria Bertacco, University of Michigan, US  
**Abstract**  
In the past decade, silicon technology trends into the nanometer regime have led to significantly higher transistor failure rates. Moreover, these trends are expected to exacerbate with future devices. To enhance reliability, several approaches leverage the inherent core-level and processor-level redundancy present in large chip multiprocessors. However, all of these methods incur high overheads, making them impractical. In this paper, we propose 3DFAR, a novel architecture leveraging 3-dimensional fabrics layouts to efficiently enhance reliability in the presence of faults. Our key idea is based on a fine-grained reconfigurable pipeline for multicore processors, which minimizes routing delay among spare units of the same type by using physical layout locality and efficient interconnect switches, distributed over multiple vertical layers. Our evaluation shows that 3DFAR outperforms state-of-the-art reliable 2D solutions, at a minimal area cost of only 7% over an unprotected design.  
Download Paper (PDF; Only available from the DATE venue WiFi)

[PI-17] **EVALUATING IMPACT OF HUMAN ERRORS ON THE AVAILABILITY OF DATA STORAGE SYSTEMS**

**Speaker:** Hossein Asadi, Sharif University of Technology, IR  
**Authors:** Mostafa Kishani, Reza Eftekhari and Hossein Asadi, Sharif University of Technology, IR  
**Abstract**  
In this paper, we investigate the effect of incorrect disk replacement service on the availability of data storage systems. To this end, we first conduct Monte Carlo simulations to evaluate the availability of disk subsystem by considering disk failures and incorrect disk replacement service. We also propose a Markov model that corroborates the Monte Carlo simulation results. We further extend the proposed model to consider the effect of automatic disk fail-over policy. The results obtained by the proposed model show that overlooking the impact of incorrect disk replacement can result up to three orders of magnitude unavailability underestimation. Moreover, this study suggests that by considering the effect of human errors, the conventional believes about the dependability of different RAID mechanisms should be revised. The results show that in the presence of human errors, RAID1 can result in lower availability compared to RAID5.  
Download Paper (PDF; Only available from the DATE venue WiFi)

[PI-18] **GPUGUARD: TOWARDS SUPPORTING A PREDICTABLE EXECUTION MODEL FOR HETEROGENEOUS SOC**

**Speaker:** Björn Forsberg, ETH Zürich, CH  
**Authors:** Björn Forsberg1, Andrea Marongiu2 and Luca Benini3  
1ETH Zürich, CH; 2Swiss Federal Institute of Technology in Zurich (ETHZ), CH; 3Università di Bologna, IT  
**Abstract**  
The deployment of real-time workloads on commercial off-the-shelf (COTS) hardware is attractive, as it reduces the cost and time-to-market of new products. Most modern high-end embedded SoCs rely on a heterogeneous design, coupling a general-purpose multi-core CPU to a massively parallel accelerator, typically a programmable GPU, sharing a single global DRAM. However, because of non-predictable hardware arbiters designed to maximize average or peak performance, it is very difficult to provide timing guarantees on such systems. In this work we present our ongoing work on GPUguard, a software technique that predictably arbitrates main memory usage in heterogeneous SoCs. A prototype implementation for the NVIDIA Tegra TX1 SoC shows that GPUguard is able to reduce the adverse effects of memory sharing, while retaining a high throughput on both the CPU and the accelerator.  
Download Paper (PDF; Only available from the DATE venue WiFi)
**A NON-INTRUSIVE, OPERATING SYSTEM INDEPENDENT SPINLOCK PROFILER FOR EMBEDDED MULTICORE SYSTEMS**

**Speaker:**
Lin Li, Infineon Technologies, DE

**Authors:**
Lin Li¹, Philipp Wagner², Albrecht Mayer¹, Thomas Wild² and Andreas Herkersdorf³

¹Infineon Technologies, DE; ²Technical University of Munich, DE; ³TU München, DE

**Abstract**
Locks are widely used as a synchronization method to guarantee the mutual exclusion for accesses to shared resources in multi-core embedded systems. They have been studied for years to improve performance, fairness, predictability etc. and a variety of lock implementations optimized for different scenarios have been proposed. In practice, applying an appropriate lock type to a specific scenario is usually based on the developer's hypothesis, which could mismatch the actual situation. A wrong lock type applied may result in lower performance and unfairness. Thus, a lock profiling tool is needed to increase the system transparency and guarantee the proper lock usage. In this paper, an operating-system-independent lock profiling approach is proposed as there are many different operating systems in the embedded field. This approach detects lock acquisition and lock releasing using hardware tracing based on hardware-level spinlock characteristics instead of specific libraries or APIs.

**Download Paper (PDF; Only available from the DATE venue WiFi)**

---

**4.1 IT&A Session: The Emergence of Silicon Photonics: From High Performance Computing to Data Centers and Quantum Computing**

**Date:**
Tuesday 28 March 2017

**Time:**
17:00 - 18:30

**Location / Room:**
SBC

**Organiser:**
Luca Carloni, Columbia University, US

**Chair:**
Luca Carloni, Columbia University, US

Recent years have seen major progress in the design and manufacturing of silicon photonics devices. This session provides an overview of the potential that this emerging technology offers for three different types of system and discusses the most important challenges that remain to be addressed. The first talk shows how silicon photonics components can be used to realize energy-efficient high-bandwidth optical interconnection networks. The second talk presents which further advances in manufacturing, packages and testing are needed in order to realize silicon photonics based products for data centers. Finally, the last talk explains how the generation of optical quantum states on an integrated platform can enable future practical implementations of quantum information processing systems.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>17:00</td>
<td>4.1.1</td>
<td>ENERGY-PERFORMANCE OPTIMIZED DESIGN OF SILICON PHOTONIC INTERCONNECTION NETWORKS FOR HIGH-PERFORMANCE COMPUTING</td>
<td>Keren Bergman, Columbia University, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Speaker:</strong></td>
<td>Keren Bergman, Columbia University, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Authors:</strong></td>
<td>Meisam Bahadori¹, Sebastien Rumley¹, Robert Polster³a, Alexander Gazman¹, Matt Traverso¹, ²Mark Webster², Kaushik Patel² and Keren Bergman¹</td>
</tr>
<tr>
<td></td>
<td></td>
<td>¹Columbia University, US; ²Cisco System, US</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>We present detailed electrical and optical models of the elements that comprise a WDM silicon photonic link. The electronics is assumed to be based on 65 nm CMOS node and the optical modulators and demultiplexers are based on microring resonators. The goal of this study is to analyze the energy consumption and scalability of the link by finding the right combination of (number of channels X data rate per channel) that fully covers the available optical power budget. Based on the set of empirical and analytical models presented in this work, a maximum capacity of 0.75 Tbps can be envisioned for a point-topoint link with an energy consumption of 1.9 pJ/bit. Sub-pJ/bit energy consumption is also predicted for aggregated bitrates up to 0.35 Tbps.</td>
</tr>
<tr>
<td>17:30</td>
<td>4.1.2</td>
<td>RAPID GROWTH OF IP TRAFFIC IS DRIVING ADOPTION OF SILICON PHOTONICS IN DATA CENTERS</td>
<td>Kaushik Patel, Cisco Systems, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Speaker and Author:</strong></td>
<td>Kaushik Patel, Cisco Systems, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>With the dramatic growth in consumers using Mobile plus Video data and the corresponding increase in IP traffic, more Data Centers are required together with a need to scale the capacity within the Data Centers. Moore’s law continues to push advances in CMOS technology enabling the design of larger higher capacity ASICs used to build Switches and Routers in the Data Centers. The cost, power dissipation and face plate optical density challenges are being solved by Silicon Photonics deployed in smaller form factor pluggable optics with a longer term transition to embedded optics. This march towards higher data rates, lower cost and lower power dissipation requires major advances in the cost, volume wafer manufacturing, optical packaging and test for Silicon Photonics based products. The focus of this talk will be on how Cisco is addressing these multiple development and manufacturing challenges as Silicon Photonics based products are released in the market.</td>
</tr>
<tr>
<td>18:00</td>
<td>4.1.3</td>
<td>GENERATION OF COMPLEX QUANTUM STATES VIA INTEGRATED FREQUENCY COMBS</td>
<td>Roberto Morandotti, INRS-EMT, CA</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Speaker:</strong></td>
<td>Roberto Morandotti, INRS-EMT, CA</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Authors:</strong></td>
<td>Christian Reimer¹, Michael Kues², Piotr Roztoczy², Benjamin Wetzel³, Brent E. Little⁴, Sai T. Chu⁵, Luca Caspani⁶, David J. Moss⁷ and Roberto Morandotti¹</td>
</tr>
<tr>
<td></td>
<td></td>
<td>¹INRS-EMT, CA; ²INRS-EMT &amp; University of Glasgow, CA; ³INRS-EMT &amp; University of Sussex, CA; ⁴an Institute of Optics and Precision Mechanics, CN; ⁵City University of Hong Kong, CN; ⁶University of Strathclyde, GB; ⁷Swinburne University of Technology, AU</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>The generation of optical quantum states on an integrated platform will enable low cost and accessible advances for quantum technologies such as secure communications and quantum computation. We demonstrate that integrated quantum frequency combs (based on high-Q microring resonators made from a CMOS-compatible, high-refractive-index glass platform) can enable, among others, the generation of heralded single photons, cross-polarized photon pairs, as well as bi- and multi-photon entangled qubit states over a broad frequency comb covering the S, C, L telecommunications band, constituting an important cornerstone for future practical implementations of photonic quantum information processing.</td>
</tr>
</tbody>
</table>

**Download Paper (PDF; Only available from the DATE venue WiFi)**

---

**4.2 Logic, Interconnects, Neurons: New Realizations**

**Date:**
Tuesday 28 March 2017

**Time:**
17:00 - 18:30

**Location / Room:**
4BC

*The Exhibition Reception will take place on Tuesday in the exhibition area, where free drinks for all conference delegates and exhibition visitors will be offered. All exhibitors are welcome to also provide drinks and snacks for the attendees.*
This session covers papers showing new approaches to realize optimized logic circuit using silicon nanowire reconfigurable transistors; intra- and inter-core optoelectronic interconnects for energy efficient communications; and magnetic skyrmions as novel nanoelectronic device for non-linear neuron networks.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>17:00</td>
<td>4.2.1</td>
<td>EXPLOITING TRANSISTOR-LEVEL RECONFIGURATION TO OPTIMIZE COMBINATIONAL CIRCUITS</td>
<td>Michael Raitza, Technische Universität Dresden, DE</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Michael Raitza¹, Jens Trommer², Akash Kumar³, Marcus Völz⁴, Dennis Walter⁵, Walter Weber⁶ and Thomas Mikolajick⁷</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>¹Technische Universität Dresden and CFAED, DE; ²NamLab gGmbH, DE; ³Technische Universität Dresden, DE; ⁴SNT University of Luxembourg, LU; ⁵Technische Universität Dresden, DE; ⁶NamLab gGmbH and CFAED, DE; ⁷NamLab GmbH / TU Dresden, DE</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td>Silicon nanowire reconfigurable field effect transistors (SiNW RFETs) abolish the physical separation of n-type and p-type transistors by taking up both roles in a configurable way within a doping-free technology. However, the potential of transistor-level reconfigurability has not been demonstrated in larger circuits, so far. In this paper, we present first steps to a new compact and efficient design of combinational circuits by employing transistor-level reconfiguration. We contribute new basic gates realized with silicon nanowires, such as 2/3-XOR and MUX gates. Exemplifying our approach with 4-bit, 8-bit and 16-bit conditional carry adders, we were able to reduce the number of transistors to almost one half. With our current case study we show that SiNW technology can reduce the required chip area by 16%, despite larger size of the individual transistor, and improve circuit speed by 26%.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>17:30</td>
<td>4.2.2</td>
<td>AUTOMATIC PLACE-AND-ROUTE OF EMERGING LED-DRIVEN WIRES WITHIN A MONOLITHICALLY-INTEGRATED CMOS+III-V PROCESS</td>
<td>Tushar Krishna, Georgia Institute of Technology, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Tushar Krishna¹, Anu Balachandran², Siau Ben Chiah³, Li Zhang⁴, Beng Wang⁵, Cong Wang⁶, Kenneth Lee Eng Kian⁷, Jurgen Michel⁸ and Li-Shiuan Peh⁹</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>¹Georgia Institute of Technology, US; ²JNTU, SG; ³SMART, SG; ⁴MIT, US; ⁵Professor, National University of Singapore, SG</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td>We leverage a recently demonstrated CMOS compatible III-V and Si monolithic integrated process to design photonic links comprising LEDs and photodiodes, as direct replacements for on-chip electrical wires. To enable VLSI-scale design of chips with such LED links, we create a library of opto-electronic standard cells, and model waveguides as traditional metal layers. This lets us integrate LED links into a commercial place-and-route tool, which treats them as electrical cells and wires for the most part, reducing design effort. We also add support for automated replacement of electrical nets with LED links. We find that LED-interconnect based designs substantially lower energy consumption vs. electrical copper wires (~39% reduction in the Network-on-Chip, ~27% reduction within a processor core) while achieving the same latency and bandwidth, demonstrating the promise of LED on-chip interconnects.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>18:00</td>
<td>4.2.3</td>
<td>A TUNABLE MAGNETIC SKYRMION NEURON CLUSTER FOR ENERGY EFFICIENT ARTIFICIAL NEURAL NETWORK</td>
<td>Deliang Fan, University of Central Florida, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Zhezhi He¹ and Deliang Fan²</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>¹Department of ECE, University of Central Florida, US; ²University of Central Florida, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td>Artificial neuron is one of the fundamental computing unit in brain-inspired artificial neural network. The standard CMOS based artificial neuron designs to implement non-linear neuron activation function typically consist of large number of transistors, which inevitably causes large area and power consumption. There is a need for novel nanoelectronic device that can intrinsically and efficiently implement such complex non-linear neuron activation function. Magnetic skyrmions are topologically stable chiral spin textures due to Dzyaloshinskii-Moriya interaction in bulk magnets or magnetic thin films. They are promising next-generation information carrier owing to ultra-small size (sub-10nm), high speed (&gt;10^10/s) and ultra-low depinning current density (MA/cm²) and high defect tolerance compared to conventional magnetic domain wall motion devices. In this work, to the best of our knowledge, we are the first to propose a threshold-tunable artificial neuron based on magnetic skyrmion. Meanwhile, we propose a Skyrmion Neuron Cluster (SNC) to approximate non-linear soft-limiting neuron activation functions, such as the most popular sigmoid function. The device to system simulation indicates that our proposed SNC leads to 98.74% recognition accuracy in deep learning Convolutional Neural Network (CNN) with MNIST handwritten digits dataset. Moreover, the energy consumption of our proposed SNC is only 3.1 fJ/step, which is more than two orders lower than that of CMOS counterpart.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>18:30</td>
<td>IF2-1</td>
<td>COMPACT MODELING AND CIRCUIT-LEVEL SIMULATION OF SILICON NANOPHOTONIC INTERCONNECTS</td>
<td>Yuyang Wang, UC Santa Barbara, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Rui Wu, Yuyang Wang, Zeyu Zhang, Chong Zhang, Clint Schow, John Bowers and Kwang-Ting Cheng, UC Santa Barbara, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td>Nanophotonic interconnects have been playing an increasingly important role in the datacom regime. Greater integration of silicon photonics demands modeling and simulation support for design validation, optimization and design space exploration. In this work, we develop compact models for a number of key photonic devices, which are extensively validated by the measurement data of a fabricated optical network-on-chip (ONoC). Implemented in SPICE-compatible Verilog-A, the models are used in circuit-level simulations of full optical links. The simulation results match well with the measurement data. Our model library and simulation approach enable the electro-optical (EO) co-simulation, allowing designers to include photonic devices in the whole system design space, and to co-optimize the transmitter, interconnect, and receiver jointly.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Download Paper (PDF; Only available from the DATE venue WiFi)
A TRUE RANDOM NUMBER GENERATOR BASED ON PARALLEL STT-MTJS

Speaker:
Yuanzhuo Qu, University of Alberta, CA
Authors:
Yuanzhuo Qu1, Jie Han1, Bruce Cockburn2, Yue Zhang1, Weisheng Zhao1 and Witold Pedrycz1
University of Alberta, CA; 2Beihang University, CN

Abstract
Random number generators are an essential part of cryptographic systems. For the highest level of security, true random number generators (TRNG) are needed instead of pseudo-random number generators. In this paper, the stochastic behavior of the spin transfer torque magnetic tunnel junction (STT-MTJ) is utilized to produce a TRNG design. A parallel structure with multiple MTJs is proposed that minimizes device variation effects. The design is validated in a 28-nm CMOS process with Monte Carlo simulation using a compact model of the MTJ. The National Institute of Standards and Technology (NIST) statistical test suite is used to verify the randomness quality when generating encryption keys for the Transport Layer Security or Secure Sockets Layer (TLS/SSL) cryptographic protocol. This design has a generation speed of 177.8 Mbit/s, and an energy of 0.64 pJ is consumed to set up the state in one MTJ.

RECONFIGURABLE THRESHOLD LOGIC GATES USING OPTOELECTRONIC CAPACITORS

Speaker:
Ragh Kuttappa, Lunal Khuon, Bahram Nabet and Baris Taskin, Drexel University, US
Authors:
Baris Taskin, Drexel University, US

Abstract
This paper investigates the integration of optoelectronic devices with CMOS threshold logic gates to design reconfigurable Boolean functions. The weight of the optoelectronic device can be altered by changing the optical power which is used to reconfigure the threshold logic (TL) gate. The proposed optoelectronic capacitor based TL (OECTL) gates are designed for i) simplistic AND/NAND gates and OR/NOR gates with large fan-in and ii) linearly separable Boolean functions that can be reconfigured to other linearly separable Boolean functions, constrained in reconfiguration by the specifics of TL operation. SPICE simulations in 65nm bulk CMOS technology with a Verilog-A model for the optoelectronic capacitor demonstrate i) AND/NAND gates and OR/NOR gates are 2X faster as fan-in increases and consumes low power ii) Boolean function can be reconfigured with 0.58X smaller delay and 0.46X lesser power of standard CMOS.

ENABLING AREA EFFICIENT RF ICS THROUGH MONOLITHIC 3D INTEGRATION

Speaker:
Panagiotis Chaourani, KTH, Royal Institute of Technology, Stockholm, SE
Authors:
Panagiotis Chaourani, Per-Erik Hellström, Saul Rodriguez, Raul Onet and Ana Rusu, KTH, Royal Institute of Technology, SE

Abstract
The Monolithic 3D (M3D) integration technology has emerged as a promising alternative to dimensional scaling thanks to the unprecedented integration density capabilities and the low interconnect parasitics that it offers. In order to support technological investigations and enable future M3D circuits, M3D design methodologies, flows and tools are essential. Prospective M3D digital applications have attracted a lot of scientific interest. This paper identifies the potential of M3D RF/analog circuits and presents the first attempt to demonstrate such circuits. Towards this, a M3D custom design platform, which is fully compatible with commercial design tools, is proposed and validated. The design platform includes process characteristics, device models, LVS and DRC rules and a parasitic extraction flow. The envisioned M3D structure is built on a commercial CMOS process that serves as the bottom tier, whereas a SOI process is used as top tier. To validate the proposed design flow and to investigate the potential of M3D RF/analog circuits, a RF front-end design for Zig-Bee WPAN applications is used as case-study. The M3D RF front-end circuit achieves 35.5 % area reduction, while showing similar performance with the original 2D circuit.

A TRUE RANDOM NUMBER GENERATOR BASED ON PARALLEL STT-MTJS

Speaker:
Yuanzhuo Qu, University of Alberta, CA
Authors:
Yuanzhuo Qu1, Jie Han1, Bruce Cockburn2, Yue Zhang1, Weisheng Zhao1 and Witold Pedrycz1
University of Alberta, CA; 2Beihang University, CN

Abstract
Random number generators are an essential part of cryptographic systems. For the highest level of security, true random number generators (TRNG) are needed instead of pseudo-random number generators. In this paper, the stochastic behavior of the spin transfer torque magnetic tunnel junction (STT-MTJ) is utilized to produce a TRNG design. A parallel structure with multiple MTJs is proposed that minimizes device variation effects. The design is validated in a 28-nm CMOS process with Monte Carlo simulation using a compact model of the MTJ. The National Institute of Standards and Technology (NIST) statistical test suite is used to verify the randomness quality when generating encryption keys for the Transport Layer Security or Secure Sockets Layer (TLS/SSL) cryptographic protocol. This design has a generation speed of 177.8 Mbit/s, and an energy of 0.64 pJ is consumed to set up the state in one MTJ.

Download Paper (PDF; Only available from the DATE venue WiFi)
MAPPING GRANULARITY ADAPTIVE FTL BASED ON FLASH PAGE RE-PROGRAMMING
Speaker:
Yazhi Feng, Wuhan National Lab for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, CN
Authors:
Yazhi Feng, Dan Feng, Chenye Yu, Wei Tong and Jingjing Liu, Wuhan National Lab for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China, CN
Abstract
The page size of NAND flash continuously grows as the manufacturing process advances. While larger page can reduce the cost per bit and improve the throughput of NAND flash, it may waste the storage space and data transfer time. Meanwhile, it causes more frequent garbage collections when serving small data write requests. To address the issues, we proposed a Mapping Granularity Adaptive FTL (MGA-FTL) based on flash page re-programming feature. MGA-FTL enables a finer granularity NAND flash space management and exploits multiple subpage writes on a single flash page without erase. 2-Level Mapping is introduced to serve requests of different sizes in order to control the overhead of DRAM requirement. Meanwhile, the allocation strategy determines whether different logical pages can be mapped to a single physical page to balance the space utilization and performance. Subpage merging limits the number of associated physical pages to a logical page, which could reduce data fragmentation and improves the performance of read operations. We compared MGA-FTL with some typical FTLs, including page-level mapping FTL and sector-log mapping FTL. Experimental results show that MGA-FTL reduces the I/O response time, write amplification and the number of erasures by 53%, 30% and 40% respectively. Despite the overhead of fine-grained management, MGA-FTL increases no more than 16.5% DRAM requirement compared with a page-level mapping FTL. Unlike the subpage-level mapping, MGA-FTL only needs one third of DRAM space for storing mapping tables.
Download Paper (PDF; Only available from the DATE venue WiFi)
<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>18:30</td>
<td>IP2-5</td>
<td>I-BEP: A NON-REdundant AND HIGH-CONCURRENTy MEMORY PERSISTENCY MODEL</td>
<td>Yuancho Xu, Capital Normal University, CN</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Authors: Yuancho Xu, Zeyi Hou, Junfeng Yan, Lu Yang and Hu Wan, Capital Normal University, CN</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Abstract: Byte-addressable, non-volatile memory (NVM) technologies enable fast persistent updates but incur potential data inconsistency upon a failure. Recent proposals present several persistence models to guarantee data consistency. However, they fail to express the minimal persist ordering as a result of inducing unnecessary ordering constraints. In this paper, we propose i-BEP, a non-redundant high concurrency memory persistency model, which expresses epoch dependency via persist directed acyclic graph instead of program order. Additionally, we propose two techniques, background persist and deferred eviction, to enhance the performance of i-BEP. We demonstrate that i-BEP can improve the performance by 15% for typical data structures on average over buffered epoch persistency (BEP) model.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Download Paper (PDF; Only available from the DATE venue WiFi)</td>
</tr>
<tr>
<td>18:31</td>
<td>IP2-6</td>
<td>SPMS: STRAND BASED PERSISTENT MEMORY SYSTEM</td>
<td>Shuo Li, National University of Defense Technology, CN</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Authors: Shuo Li\textsuperscript{1}, Peng Wang\textsuperscript{2}, Nong Xiao\textsuperscript{1}, Guangyu Sun\textsuperscript{2} and Fang Liu\textsuperscript{1}</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>\textsuperscript{1}National University of Defense Technology, CN; \textsuperscript{2}Peking University, CN</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Abstract: Emerging non-volatile memories enable persistent memory, which offers the opportunity to directly access persistent data structures residing in main memory. In order to keep persistent data consistent in case of system failures, most prior work relies on persist ordering constraints which incurs significant overheads. Strand persistency minimizes persist ordering constraints. However, there is still no proposed persistent memory design based on strand persistency due to its implementation complexity. In this work, we propose a novel persistent memory system based on strand persistency, called SPMS. SPMS consists of cacheline-based strand group tracking components, a volatile strand buffer and ultra-capacitors incorporated in persistent memory modules. SPMS can track each strand and guarantee its atomicity. In case of system failures, committed strands buffered in the strand buffer can be flushed back to persistent memory within the residual energy window provided by the ultra-capacitors. Our evaluations show that SPMS outperforms the state-of-the-art persistent memory system by 6.6% and has slightly better performance than the baseline without any consistency guarantee. What’s more, SPMS reduces the persistent memory write traffic by 30%, with the help of the strand buffer.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Download Paper (PDF; Only available from the DATE venue WiFi)</td>
</tr>
<tr>
<td>18:32</td>
<td>IP2-7</td>
<td>ARCHITECTING HIGH-SPEED COMMAND SCHEDULERS FOR OPEN-ROW REAL-TIME SDRAM CONTROLLERS</td>
<td>Leonardo Ecc\textsuperscript{1}, TU Braunschweig, DE</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Authors: Leonardo Ecc\textsuperscript{1} and Rolf Ernst\textsuperscript{2}</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>\textsuperscript{1}Institute of Computer and Network Engineering, TU Braunschweig, DE; \textsuperscript{2}TU Braunschweig, DE</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Abstract: As SDRAM modules get faster and their data buses wider, researchers proposed the use of the open-row policy in command schedulers for real-time SDRAM controllers. While the real-time properties of such schedulers have been thoroughly investigated, their hardware implementation was not. Hence, in this paper, we propose a highly-parallel and multi-stage architecture that implements a state-of-the-art open-row real-time command scheduler. Moreover, we evaluate such architecture from the hardware overhead and performance perspectives.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Download Paper (PDF; Only available from the DATE venue WiFi)</td>
</tr>
</tbody>
</table>

4.4 From functional validation to functional qualification

**Date:** Tuesday 28 March 2017

**Time:** 17:00 - 18:30

**Location / Room:** 3A

**Chair:** Graziano Pravadelli, University of Verona, IT

**Co-Chair:** Elena Ioana Vatajelu, TIMA, FR

The section presents techniques and tools to generate testcases for functional validation and to define coverage metrics for functional qualification.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>17:00</td>
<td>4.4.1</td>
<td>DATA FLOW TESTING FOR VIRTUAL PROTOTYPES</td>
<td>Muhammad Hassan, University of Bremen, DE</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Authors: Muhammad Hassan\textsuperscript{1}, Vladimir Herdt\textsuperscript{1}, Hoang M. Le\textsuperscript{1}, MingSong Chen\textsuperscript{2}, Daniel Grosse\textsuperscript{3} and Rolf Drechsler\textsuperscript{3}</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>\textsuperscript{1}University of Bremen, DE; \textsuperscript{2}East China Normal University, CN; \textsuperscript{3}University of Bremen/DFKI GmbH, DE</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Abstract: Data flow testing (DFT) has been shown to be an effective testing strategy. DFT features a high fault detection rate while avoiding the intense scalability problems to achieve full path coverage. In this paper we propose to apply data flow testing for SystemC virtual prototypes (VPs). Our contribution is twofold: First, we develop a set of SystemC specific coverage criteria for data flow testing. This requires to consider the SystemC semantics of using non-preemptive thread scheduling with shared memory communication and event-based synchronization. Second, we explain how to automatically compute the data flow coverage result for a given VP using a combination of static and dynamic analysis techniques. The coverage result provides clear suggestions for the testing engineer to add new testcases in order to improve the coverage result. Our experimental results on real-world VPs demonstrate the applicability and efficacy of our analysis approach and the SystemC specific coverage criteria to improve the testsuite.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Download Paper (PDF; Only available from the DATE venue WiFi)</td>
</tr>
</tbody>
</table>
4.5 Hot Topic Session: On How to Design and Manage Exascale Computing System Technologies

Date: Tuesday 28 March 2017
Time: 17:00 - 18:30
Location / Room: 3C

Organiser:
Donatella Sciuto, Politecnico di Milano, IT

Chair:
Donatella Sciuto, Politecnico di Milano, IT

Co-Chair:
José L. Ayala, Universidad Complutense de Madrid, ES

The growing race towards exascale computing is pushing the adoption of ever more heterogeneous systems into mainstream. The resources available on a chip, the level of
integration and the speed of components have increased dramatically over the years. Moreover, To handle the stringent performance requirements of future exascale-class applications, High Performance Computing (HPC) systems need ultra-efficient heterogeneous compute nodes. However, we keep on adopting superseded approaches to the exploitation of these resources. In this session, the speakers will focus on this requirements providing insight on how to enable the definition and the efficient deployment of such a technology.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>17:00</td>
<td>4.5.1</td>
<td>TOWARDS EXASCALE COMPUTING WITH HETEROGENEOUS ARCHITECTURES</td>
<td>Kenneth O’Brien, Xilinx Inc., IE; Lorenzo Di Tucci, Gianluca Durelli, and Maëlae Blott, Politecnico di Milano, IT</td>
</tr>
</tbody>
</table>

**Abstract**

The goal of reaching exascale computing is made especially challenging by the highly heterogeneous nature of modern platforms and the energy they consume. As compute nodes typically utilize multiple multi-core CPU and are increasingly equipped with PCIe based accelerators, both are contributing to an ever more dynamic power consumption. In our study we evaluate our target application on a variety of heterogeneous platforms, including high end FPGA, GPU, and Xeon Phi accelerators, with respect to energy efficiency at a node and cluster level. We compare multiple implementations of our application, each built with a different modern parallel programming framework, with respect to execution performance, code complexity and energy efficiency. Later we extrapolate based on our findings, the implications of scaling this application towards exascale, with projections of computation achievable within the exascale power budget for our three architectures.

Download Paper (PDF; Only available from the DATE venue WiFi)

| 17:18 | 4.5.2 | FROM EXAFLOP TO EXAFLOW | Tobias Becker, Maxeler Technologies, GB |

**Abstract**

Exascale computing is facing a gap between the ever increasing demand for application performance and the underlying chip technology that does no longer deliver the expected exponential increase in CPU performance. The industry is now progressively moving towards dedicated accelerators to deliver high performance and better energy efficiency. However, the question of programmability still remains. To address this challenge we propose a dedicated high-level accelerator programming and execution model where performance and efficiency are primary targets. Our model splits the computation into a conventional CPU-oriented part and a highly efficient fully programmable data flow part. We present a number of systematic transformations and optimisations targeting Maxeler dataflow systems that typically yield one to two orders of magnitude improvements in terms of both performance and energy efficiency. These significant gains are enabled by addressing fundamental algorithmic properties and on-demand numerical requirements. This approach is demonstrated by a case study from computational finance.

Download Paper (PDF; Only available from the DATE venue WiFi)

| 17:36 | 4.5.3 | HETEROGENEOUS EXASCALE SUPERCOMPUTING: THE ROLE OF CAD IN THE EXAFPGA PROJECT | Marco Santambrogio, Politecnico di Milano, Italy; Pavl Burovyk, Anna Maria Nestorov, Kristina Polkareva, Enrico Reggiani, and Georgi Geyadjiev, Maxeler Technologies, GB; Maxeler Technologies Ltd, GB; Politecnico di Milano, Italy; Maxeler / Imperial College, GB |

**Abstract**

Since the end of Moore’s law is limiting the growth of general purpose processors, High Performance Processing (HPC) systems are considering FPGA-based accelerators as a promising solution for several application fields. However, their employment poses challenges the research is still tackling, and existing tools and workflows do not naturally adapt to the scale and complexity of HPC domains. To help researchers and practitioners, this paper proposes CAOS, a platform that implements an FPGA development workflow tailored to HPC systems while being open to external contributions. Indeed, researchers and developers can plug into CAOS to experiment and compare their solutions at each step of the design flow. This paper describes the CAOS workflow and validates it against several case studies to assess its generality and highlight possible research contributions.

Download Paper (PDF; Only available from the DATE venue WiFi)

| 17:54 | 4.5.4 | AN OPEN RECONFIGURABLE RESEARCH PLATFORM AS STEPPING STONE TO EXASCALE HIGH-PERFORMANCE COMPUTING | Dirk Stroobandt, Ghent University, Belgium; Catalin Bogdan Ciobanu, Marco D. Santambrogio, Jose Gabriel Coutinho, Andreas Brokalakis, Dionisos Pnevmatikatos, Michael Huebner, Tobias Becker, and Alex J. W. Thom, Ghent University, Belgium; Ruva, NL; Politecnico di Milano, Italy; Imperial College London, GB; Synelixis, GR; ECE Department, Technical University of Crete & FORTH-ICS, GR; Ruhr-University Bochum, DE; Maxeler Technologies, GB; University of Cambridge, GB |

**Abstract**

To handle the stringent performance and power requirements of future exascale-class applications, High Performance Computing (HPC) systems need ultra-efficient heterogeneous compute nodes and hardware accelerators with a high degree of specialization. Ideally, dynamic reconfiguration will be an intrinsic feature, so that specific HPC application features can be optimally accelerated, even if they regularly change over time. We create a new and flexible exploration platform for developing reconfigurable architectures, design tools and HPC applications with run-time reconfiguration built-in as a core fundamental feature instead of an add-on. Our project proposes an open research platform that covers the entire stack from architecture up to the application, focusing on the fundamental building blocks for run-time reconfigurable exascale HPC systems: new chip architectures with very low reconfiguration overhead, new tools that truly take reconfiguration as a central design concept, and applications that are tuned to maximally benefit from the proposed run-time reconfiguration techniques. Ultimately, this open platform will enable groundbreaking research towards new exascale computing platforms.

Download Paper (PDF; Only available from the DATE venue WiFi)

| 18:12 | 4.5.5 | GEOPM: A VEHICLE FOR EXASCALE COMMUNITY COLLABORATION TOWARD CO-DESIGNED ENERGY MANAGEMENT SOLUTIONS | Matthias Maierth, Intel, USA; Jonathan Eastep, Intel, USA |

**Abstract**

The power scaling challenge associated with Exascale systems is a well-known issue. In this invited talk, we provide an overview of the Global Extensible Open Power Manager (GEOPM). GEOPM is an open source power management runtime framework which is being contributed to the HPC community to foster collaboration on new power management runtime techniques to address Exascale power challenges or enhance performance and power efficiency on today’s systems as well. Through GEOPM’s plug-in extensible architecture, it enables rapid prototyping of new runtime algorithms. This talk will cover GEOPM’s architecture, interfaces, and project status. For additional information, please visit: https://geopm.github.io/geopm/

18:30 | End of session |

**Exhibition Reception** in Exhibition Area

The Exhibition Reception will take place on Tuesday in the exhibition area, where free drinks for all conference delegates and exhibition visitors will be offered. All exhibitors are welcome to also provide drinks and snacks for the attendees.
This session includes a presentation about new SAT-based ATPG techniques for robust initialization of transistor stuck-open faults. Further, a diagnosis method for arbiter physical unclonable functions to identify systematic manufacturing issues is presented. The last paper analyzes failure modes of Flash memories and proposes suitable fault models.

### 17:00 4.6.1 (Best Paper Award Candidate)

**FAST AND WAVEFORM-ACCURATE HAZARD-AWARE SAT-BASED TSOF ATPG**

**Speaker:**
Jan Burchard, University of Freiburg, DE

**Authors:**
Jan Burchard, Dominik Erb, Adit D. Singh, Sudhakar M. Reddy and Bernd Becker

1University of Freiburg, DE; 2Auburn University, US; 3University of Iowa, US

**Abstract**

Opens are known to be one of the predominant defects in nanoscale technologies. Especially with an increasing number of complex cells in today's VLSI designs intra-gate opens are becoming a major problem. The generation of tests for these faults is hard, as the timing of the circuit needs to be considered accurately to prevent the invalidation of the generated tests through hazards. Current test generation methods, including new cell aware tests that explicitly target open defects, ignore the possibility of hazard caused test invalidation. Such tests can fail to detect a significant fraction of the targeted opens. In this work we present a waveform-accurate hazard-aware test generation approach to target intra-gate opens. Our methodology is based on a SAT-based encoding and allows the generation of tests guaranteed to be robust against hazards. Experimental results for large benchmarks mapped to the state-of-the-art NanGate 45nm cell library including complex cells show the test generation efficiency of the proposed method. Large circuits were efficiently handled – even without the use of fault simulation. Our experiments show that on average, about 10.92% of conventional hazard-unaware tests will fail to detect the targeted opens because of test invalidation – these are reliably detected by our new test generation methodology. Importantly, our approach can also be applied to improve the effectiveness of commercial cell aware tests.

Download Paper (PDF; Only available from the DATE venue WiFi)

### 17:30 4.6.2

**FAULT DIAGNOSIS OF ARBITER PHYSICAL UNCLONABLE FUNCTION**

**Speaker:**
Yu Hu, Institute of Computing Technology, Chinese Academy of Sciences, CN

**Authors:**
Jing Ye, Qingsi Guo, Yu Hui and Xiaowei Li

1State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, CN; 2State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, CN

**Abstract**

Physical Unclonable Function (PUF) has broad application prospects in the field of hardware security. If faults happen in PUF during manufacturing, the security of whole chip will be threatened. Fault diagnosis plays an important role in the yield learning process. However, since different manufactured PUFs with the same design have different Challenge-Response Pairs (CRPs), which cannot be predicted, the traditional fault diagnosis method based on comparing the fault-free responses of a design and the failing responses of chips is no longer suitable for diagnosing PUF. Therefore, this paper proposes a fault diagnosis method toward classic arbiter PUF. The stuck-at faults and the delay faults are considered. Based on the expected uniformity of arbiter PUF, a diagnostic challenge generation method and a corresponding CRP analysis method are proposed to distinguish faults within the arbiter PUF. Experimental results show that the diagnostic accuracy achieves 100.0% with good diagnostic resolution.

Download Paper (PDF; Only available from the DATE venue WiFi)

### 18:00 4.6.3

**FPGA-BASED FAILURE MODE TESTING AND ANALYSIS FOR MLC NAND FLASH MEMORY**

**Speaker:**
Fei Wu, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, CN

**Authors:**
Meng Zhang, Fei Wu, Qian Xia, He Huang, Jian Zhou and Changsheng Xie

1Huazhong University of Science and Technology, CN; 2University of Central Florida, US

**Abstract**

With the improvement of flash memory storage density, data reliability and flash lifetime are decreased. Error correction codes (ECC) and error management schemes can boost both reliability and lifetime. However, in order to develop effective fault tolerance algorithms and management solutions, it is very necessary to have a more profound understanding of failure modes of flash memory. To enable such understanding, we design an experimental platform and scheme to clearly investigate flash failure modes. This paper examines various failure modes occurring at 2x-nm MLC NAND flash technologies, such as page allocation scheme-based program interference (PASSPI) errors (i.e., different page allocation schemes mean data can be programmed into flash pages in different ways, which can lead to different program interference errors), write errors of the least significant bit (LSB) and the most significant bit (MSB) and different data pattern-based read interference errors (i.e., different data values programmed into flash pages can cause differential read interference errors). We analyze these observed failure modes and explain why they exist. We hope it is helpful to understand these discovered failure modes to propose effective fault tolerance and error management algorithms.

Download Paper (PDF; Only available from the DATE venue WiFi)

### 18:30 4.6.4

**IP2-10, 342**

**RETRODMR: TROUBLESHOOTING NON-DETERMINISTIC FAULTS WITH RETROSPECTIVE DMR**

**Speaker:**
Ting Wang, The Chinese University of Hong Kong, HK

**Authors:**
Ting Wang, Yannan Liu, Qian Xu, Zhasbo Zhang, Zhiyuan Wang and Xinli Gu

1The Chinese University of Hong Kong, HK; 2Huawei Technologies, Inc., US

**Abstract**

The most notorious faults for diagnosis in post-silicon validation are those that manifest themselves in a non-deterministic manner with system-level functional tests, where errors randomly appear from time to time even when applying the same workloads. In this work, we propose a novel diagnostic framework that resorts to dual-modular redundancy (DMR) for troubleshooting non-deterministic faults, namely RetroDMR. To be specific, we log the essential events (e.g., the sequence of thread migration) in the faulty run to record the mapping relationship between threads and their corresponding execution units. Then in the following diagnosis runs, we apply redundant multithreading (RMT) technique to reduce error detection latency, while at the same time we try to follow the thread migration sequence of the original run whenever possible. By doing so, RetroDMR significantly improves the reproduction rate and diagnosis resolution for non-deterministic faults, as demonstrated in our experimental results.

Download Paper (PDF; Only available from the DATE venue WiFi)
This paper introduces an on-line heuristic to maximize soft-error reliability while satisfying a lifetime reliability constraint for soft real-time systems executed on heterogeneous MPSoCs consisting of high-performance cores and low-power cores. Based on the run-time cores’ frequencies and utilizations, the heuristic performs workload migration between the high-performance cores and low-power cores to achieve improved soft-error reliability. Experimental results from both a hardware platform and a simulator show that the proposed algorithm reduces the probability of faults by at least 30% compared to a number of representative existing migration between the high-performance cores and low-power cores to achieve improved soft-error reliability. In this paper, an approach for increasing the sustainability of inverter-based memristive neuromorphic circuits in the presence of process variation is presented. The approach works based on extracting the impact of process variations on the neurons characteristics during the test phase through a proposed algorithm. In this method, first, some combinations of inputs and weights (based on the neuromorphic circuit structure) are injected into the circuit and the features of the neurons are determined. Next, these features which are back-annotated, are utilized in an efficient ex-situ training approach to determine the proper weights of the neurons. The approach provides a considerable improvement in the output accuracy. To evaluate the effectiveness of the proposed approach, some approximate applications are studied using 90nm technology. The results of the study reveal that using this framework provide, on average, 17X higher output accuracy compared to the cases that the impact of the process variation is not considered at all.

The session covers variable-aware solutions at the system and circuit level. Firstly, neuromorphic circuits are addressed and its relation with process variation. After that, variability management for today’s and tomorrow’s computing is again addressed but, this time, for entire computing systems.

**Abstract**

In this paper, an approach for increasing the sustainability of inverter-based memristive neuromorphic circuits in the presence of process variation is presented. The approach works based on extracting the impact of process variations on the neurons characteristics during the test phase through a proposed algorithm. In this method, first, some combinations of inputs and weights (based on the neuromorphic circuit structure) are injected into the circuit and the features of the neurons are determined. Next, these features which are back-annotated, are utilized in an efficient ex-situ training approach to determine the proper weights of the neurons. The approach provides a considerable improvement in the output accuracy. To evaluate the effectiveness of the proposed approach, some approximate applications are studied using 90nm technology. The results of the study reveal that using this framework provide, on average, 17X higher output accuracy compared to the cases that the impact of the process variation is not considered at all.

**Download Paper (PDF; Only available from the DATE venue WiFi)**
APPLICATION PERFORMANCE IMPROVEMENT BY EXPLOITING PROCESS VARIABILITY ON FPGA DEVICES

Speaker: Konstantinos Maragos, National Technical University of Athens, GR
Authors: Konstantinos Maragos, George Lentaris, Kostas Siozios, Dimitrios Soudris and Vasilis Pavlidis
National Technical University of Athens, GR; The University of Manchester, GR

Abstract
Process variability is known to be increasing with technology scaling in IC fabrication, thereby degrading the overall performance of the manufactured devices. The current paper focuses on the variability effect in FPGAs and the possibility to boost the performance of each device at run-time, after fabrication, based on the individual characteristics of this device. First, we develop a sensing infrastructure involving a wide network of customized ring oscillators to measure intra-chip and inter-chip variability in 28nm FPGAs, i.e., in eight Xilinx Zynq XC7Z020T-1CSG324 devices. Second, we develop a closed-loop framework based on dynamic reconfiguration of clock tiles, I/O data sniffing, HW/SW communication, and verification with test vectors, to dynamically increase the operating frequency in Zynq while preserving its correctness. Our results show intra-chip variability in the area of 5.2% to 7.7% and inter-chip variability up to 17%. Our framework improves the performance of example FIR designs by up to 90.3% compared to the SW tool reports and shows speed difference among devices by up to 12.4%.

Download Paper (PDF; Only available from the DATE venue WiFi)

4.8 CV Fair DATE 2017

Date: Tuesday 28 March 2017
Time: 17:00 - 18:30
Location / Room: Exhibition Theatre

Organiser: Marisa Lopez-Vallejo, UPM, ES
Moderator: Marisa Lopez-Vallejo, UPM, ES

The Curriculum Vitae (also known as a vita or CV) is the first point of contact between employee and employer. It must provide a concise overview of academic background and achievements. Furthermore, it usually should catch the attention of the readers, get them to take a closer look at you and ultimately invite you for an interview. Philippe Ory, Head of the EPFL Career Center, will open this CV Fair with a talk on the key issues that must be addressed when writing a CV.

Afterwards, organizations participating in the CV Fair will give a brief presentation with basic information about the company, potential positions or internships, what types of students are being sought, etc. The CV fair is designed to allow for students to engage in individual conversations with the company or organization team and ask specific questions that may have arisen during the presentation.

UB04 Session 4

Date: Tuesday 28 March 2017
Time: 17:30 - 19:30
Location / Room: Booth 1, Exhibition Area

End of session
Exhibition Reception in Exhibition Area
The Exhibition Reception will take place on Tuesday in the exhibition area, where free drinks for all conference delegates and exhibition visitors will be offered.
All exhibitors are welcome to also provide drinks and snacks for the attendees.
NOXIM-XT: A BIT-ACCURATE POWER ESTIMATION SIMULATOR FOR NOCS

Presenter: Pierre Bornel, Université de Bretagne Sud, FR
Authors: André Rossi1, Johann Laurent2 and Erwan Moreac2
1LERIA, Université d’Angers, Angers, France, FR; 2Lab-STICC, Université de Bretagne Sud, Lorient, FR

Abstract: We have developed an enhanced version of Noxim (Noxim-XT) to estimate the energy consumption of a NoC in a SoC. Noxim-XT is used in a two-step methodology. First, applications are mapped on a SoC and their traffics are extracted by simulation with MPSOcBench. Second, Noxim-XT tests various hardware configurations of the NoC, and for each configuration, the application’s traffic is re-injected and replayed, an accurate performance and power breakdown is provided, and the user can choose different data coding strategies. With the help of Noxim XT, each configuration is bit-accurately estimated in terms of energy consumption. After simulation, a spatial mapping of the energy consumption is provided and highlights the hot-spots. Moreover, the new coding strategies allows significant energy saving. Noxim XT simulations and a FPGA-based prototype of a new coding strategy will be demonstrated at the U-booth to illustrate these works.

More information ...

RIMEDIO: WHEELCHAIR MOUNTED ROBOTIC ARM Demonstrator for people with Motor Skills Impairments

Presenter: Alessandro Palla, University of Pisa, IT
Authors: Gabriele Meoni and Luca Fanucchi, University of Pisa, IT

Abstract: People with reduced mobility experiment many issues in the interaction with the indoor and outdoor environment because of their disability. For those users even the simplest action might be a hard/impossible task to perform without the assistance of an external aid. We propose a simple and lightweight wheelchair mounted robotic arm with the focus on the human-machine interface that has to be simple and accessible for users with different kinds of disabilities. The robotic arm is equipped with a 5 MP camera, force and proximity sensors and a 6 axis Inertial Measurement Unit on the end-effector that can be controlled using an app running on a tablet. When the user selects the object to reach (for instance a button) on the tablet screen, the arm autonomously carries out the task, using the camera image and the sensors measurements for autonomous navigation. The demonstrator consists in the robotic arm prototype, the Android tablet and a personal computer for arm setup and configuration.

More information ...

OPENCTMOD: AN OPEN SOURCE COLLABORATIVE MATLAB TOOLBOX FOR THE DESIGN AND SIMULATION OF CONTINUOUS-TIME SIGMA DELTA MODULATORS

Presenter: Dang-Kien Germain Pham, LTCI, Télécom ParisTech, Université Paris-Saclay, 75013, Paris, France, FR
Author: Chadi Jabbour, LTCI, Télécom ParisTech, Université Paris-Saclay, FR

Abstract: Simulating Continuous Time (CT) Sigma Delta Modulators (SDM) is commonly done using block level systems such as Simulink which is a highly time consuming task even at system level. Therefore, the existing design tools for SDM are either discrete time oriented (Schreier toolbox) or proprietary (Ulm toolbox). In this work, we propose a new Matlab/C toolbox for the design of CT SDM. Simulation is based on state space representation thereby allowing to support most of the existing SDM architectures. Moreover, the main non-idealities of the main blocks are modeled (opamp DC gain, finite GBW, DACs mismatch, ISI and quantizer offset). Besides, thanks to the modular and open source approach for this toolbox, every user can easily implement additional features and include it. During the forum, designs and simulations for various architectures of CT SDM will be performed to demonstrate the accuracy and efficiency of the proposed toolbox. The collaborative aspect will be also shown.

More information ...

MATISSE: A TARGET-AWARE COMPILER TO TRANSLATE MATLAB INTO C AND OPENCL

Presenter: Luís Reis, University of Porto, PT
Authors: João Bispo and João Cardoso, University of Porto / INESC-TEC, PT

Abstract: Many engineering, scientific and finance algorithms are prototyped and validated in array languages, such as MATLAB, before being converted to other languages such as C for use in production. As such, there has been substantial effort to develop compilers to perform this translation automatically. Alternative types of computation devices, such as GPGPUs and FPGAs, are becoming increasingly more popular, so it becomes critical to develop compilers that target these architectures. We have adapted MATISSE, our MATLAB-compatible compiler framework, to generate C and OpenCL code for these platforms. In this demonstration, we will show how our compiler works and what its capabilities are. We will also describe the main challenges of efficient code generation from MATLAB and how to overcome them.

More information ...

A VOLTAGE-SCALABLE FULLY DIGITAL ON-CHIP MEMORY FOR ULTRA-LOW-POWER IOT PROCESSORS

Presenter: Jun Shiomi, Kyoto University, JP
Authors: Tohru Ishihara and Hidetoshi Onodera, Kyoto University, JP

Abstract: A voltage-scalable RISC processor integrating standard-cell based memory (SCM) is demonstrated. Unlike conventional processors, the processor has Standard-Cell based memories (SCMs) as an alternative to conventional SRAM macros, enabling it to operate at a 0.4 V single-supply voltage. The processor is implemented with the fully automated cell-based design, which leads to low design costs. By scaling the supply voltage and applying the back-gate biasing techniques, the power dissipation of the SCMs is less than 20 uW, enabling the SCMs to operate with ambient energy source only. In this demonstration, the SCMs of the processor operates with a lemon battery as the ambient energy source.

More information ...

GNOCs: AN ULTRA-FAST, HIGHLY EXTENSIBLE, CYCLE-ACCURATE GPU-BASED PARALLEL NETWORK-ON-CHIP SIMULATOR

Presenter: Amir CHARIF, TIMA, FR
Authors: Nacer-Eddine Zergainoh and Michael Nicolaidis, TIMA, FR

Abstract: With the continuous decrease in feature sizes and the recent emergence of 3D stacking, chips comprising thousands of nodes are becoming increasingly relevant, and state-of-the-art NoC simulators are unable to simulate such a high number of nodes in reasonable times. In this demo, we showcase GNOCs, the first detailed, modular and scalable parallel NoC simulator running fully on GPU (Graphics Processing Unit). Based on a unique design specifically tailored for GPU parallelism, GNOCs is able to achieve unprecedented speedups with no loss of accuracy. To enable quick and easy validation of novel ideas, the programming model was designed with high extensibility in mind. Currently, GNOCs accurately models a VC-based microarchitecture. It supports 2D and 3D mesh topologies with full or partial vertical connections. A variety of routing algorithms and synthetic traffic patterns, as well as dependency-driven trace-based simulation (Netrace), are implemented and will be demonstrated.
ACCELERATORS: RECONFIGURABLE SELF-TIMED DATAFLOW ACCELERATOR & FAST NETWORK ANALYSIS IN SILICON

Presenter: Alessandro de Gennaro, Newcastle University, GB
Authors: Danil Sokolov and Andrey Mokhov, Newcastle University, GB
Abstract: Many real-life applications require dynamically reconfigurable pipelines to handle incoming data items differently depending on their values or current operating mode. A demo will show the benefits of an asynchronous accelerator for ordinal pattern encoding with reconfigurable pipeline depth. This was designed, simulated and verified using dataflow structure formalism in Workcraft toolset. The self-timed chip, fabricated in TSMC 90nm, shows high resilience to voltage variation and configurable accuracy of the results. Applications with underlying graph models foster the importance of a fast and flexible approach to graph analysis. To support medicine discovery biological connections are modelled by graphs, and drugs can disconnect some of the connections. A demo will show how graphs can be automatically converted into VHDL designs, which are synthesised into a FPGA for the analysis: thousand times faster than in software. Single stand will be used for both case studies.

SELINK: SECURING HTTP AND HTTPS-BASED COMMUNICATION VIA SECUBE™

Presenter: Airofarulla Giuseppe, CINI & Politecnico di Torino, IT
Authors: Paolo Prinetto and Antonio Varriale
1Politecnico di Torino, IT; 2Blu5 Labs Ltd., IT
Abstract: The SEcube™ Open Source platform is a combination of three main cores in a single-chip design. Low-power ARM Cortex-M4 processor, a flexible and fast Field-Programmable-Gate-Array (FPGA), and an EAL5+ certified Security Controller (SmartCard) are embedded in an extremely compact package. This makes it a unique Open Source security environment where each function can be optimized, executed, and verified on its proper hardware device. In this demo, we present a client-server HTTP and HTTPS-based application, for which the traffic is encrypted resorting to the hardware built-in capabilities, and the software libraries, of the SEcube™. By doing so, we show how communication can be secured from an attacker capable of inspecting, and tampering, the regular communication.

GREENOPENHEVC: LOW POWER HEVC DECODER

Presenter: Menard Daniel, INSA Rennes, FR
Authors: Julien Heulot, Erwan Nogues, Maxime Pelcat and Wassim Hamidouche
1INSA Rennes, IETR, UBL, FR; 2Institut Pascal, Université Clermont-Ferrand, FR
Abstract: Video on mobile devices is a must-have feature with the prominence of new services and applications using video like streaming or conferencing. The new video standard HEVC is an appealing technology for service providers. Besides, with the recent progress of SoC, software video decoders are now a reality. The challenge is to provide power efficient design to fit with the compelling demand for long battery. We present here a practical set-up demonstrating that the new HEVC standard can be implemented in software on an embedded GPP multicore platform. Different techniques have been integrated to optimize the energy: data-level and thread level parallelisms, video aware Dynamic Voltage and Frequency Scaling. To push back the limits, algorithm level approximate computing is carried-out on the in-loop filtering. The subjective tests have demonstrated that the quality degradation is almost imperceptible. A mean power of less than 1 Watt is reported for a HD 1080p/24fps video decoding.

End of session

Exhibition Reception

Date: Tuesday 28 March 2017
Time: 18:30 - 19:30
Location / Room: Exhibition Area

The Exhibition Reception will take place on Tuesday in the exhibition area, where free drinks for all conference delegates and exhibition visitors will be offered. All exhibitors are welcome to also provide drinks and snacks for the attendees.

End of session

5.1 IoT Day: IoT Perspectives

Date: Wednesday 29 March 2017
Time: 08:30 - 10:00
Location / Room: 5BC

Organisers: Marilyn Wolf, Georgia Tech, US
Andreas Herkersdorf, TU Muenchen, DE

Chair: Marilyn Wolf, Georgia Tech, US
Co-Chair: Andreas Herkersdorf, TU Muenchen, DE

The DATE 2017 Special Day on IoT will be kicked-off by perspective talks from academia and industry sharing their views and experience from backgrounds of large distributed sensor networks and cognitive computing. The entire spectrum of IoT devices and computing, storage and communication infrastructure, from smallest form factor sensors to Cloud backbone systems will be considered.

End of session
DESIGN FOR IOT

Author: Lothar Thiele, Swiss Federal Institute of Technology Zurich, CH

Abstract

If visions and forecasts of industry come true then we will be soon surrounded by billions of interconnected embedded devices. We will interact with them in a cyber-human symbiosis, they will not only observe us but also our environment, and they will be part of many visible and ubiquitous objects around us. We have the legitimate expectation that the individual devices as well as the overall system behaves in a reliable and predictable manner. This is an indispensable requirement as it is infeasible to constantly maintain such a large set of devices. In addition, there are many application domains where we rely on a correct and fault-free system behavior. We expect trustworthy results from sensing, computation, communication and actuation due to economic importance or even catastrophic consequences if the overall system is not working correctly, e.g., in industrial automation, distributed control of energy systems, surveillance, medical applications, or early warning scenarios in the context of building safety or environmental catastrophes. Finally, trustworthiness and reliability are mandatory for the societal acceptance of human-cyber interaction and cooperation. It will be argued that we need novel architectural concepts, an associated design process and validations strategies to satisfy the strongly conflicting requirements and associated design challenges of platforms for CPS: Handle at the same time limited available resources, adaptive run-time behavior, and predictability. These challenges concern all components of an IoT system, e.g., computation, storage, wireless communication, energy management, harvesting, sensing and sensor interfaces, and actuation. The talk will be driven by examples from various application domains such as smart watches, zero-power systems, environmental sensing, and air pollution sensing.

THE INTERNET OF THINGS IN THE COGNITIVE ERA

Author: Alesandro Curioni, IBM Zurich Research, CH

Abstract

Over next few years, the Internet of Things will become the biggest source of data on the planet. That's where IBM's Watson cognitive computing system comes in. Watson uses machine learning and other techniques to understand this data and turn it into insight, which can help automate tasks, enable manufacturers to design better products, innovate new services and enhance our overall quality of life. And with cognitive technologies, interactions with 'things' through natural language and voice commands will dramatically improve. This presentation will focus on how innovators in the design automation and embedded systems space can benefit from this trend and get access IBM Watson in the cloud.

5.2 Emerging Computer Paradigms

Date: Wednesday 29 March 2017
Time: 08:30 - 10:00
Location / Room: 4BC
Chair: Jim Harkin, Ulster University, GB

This session presents recent advances in emerging computing strategies including Reversible Computing and Stochastic Computing with improvements in energy efficiency and reductions in computational complexity. An acceleration platform for the design exploration of Quantum Computers is also presented.

MAKE IT REVERSIBLE: EFFICIENT EMBEDDING OF NON-REVERSIBLE FUNCTIONS

Speaker: Alwin Zulehner, Johannes Kepler University, Linz, AT
Authors: Alwin Zulehner\(^1\) and Robert Wille\(^2\)
\(^1\)Johannes Kepler University, AT; \(^2\)Johannes Kepler University Linz, AT

Abstract

Reversible computation became established as a promising concept due to its application in various areas like quantum computation, energy-aware circuits, and further areas. Unfortunately, most functions of interest are non-reversible. Therefore, a process called embedding has to be conducted to transform a non-reversible function into a reversible one - a coNP-hard problem. Existing solutions suffer from the resulting exponential complexity and, hence, are limited to rather small functions only. In this work, an approach is presented which tackles the problem in an entirely new fashion. We divide the embedding process into matrix operations, which can be conducted efficiently on a certain kind of decision diagram. Experiments show that improvements of several orders of magnitudes can be achieved using the proposed method. Moreover, for many benchmarks exact results can be obtained for the first time ever.

Download Paper (PDF; Only available from the DATE venue WiFi)
Abstract

Quantum computing is rapidly evolving especially after the discovery of several efficient quantum algorithms solving intractable classical problems such as Shor's factoring algorithm. However the realization of a large-scale physical quantum computer is very challenging and the number of qubits that are currently under development is still very low, namely less than 15. In the absence of large size platforms, quantum computer simulation is critical for developing and testing quantum algorithms and investigating the different challenges facing the design of quantum computer hardware. What makes quantum computer simulation on classical computers particularly challenging are the memory and computational resource requirements. In this paper, we introduce a universal quantum computer simulator, called QX, that takes as input a specially designed quantum assembly language, called QASM, and provides, through aggressive optimisations, high simulation speeds and large number of qubits. QX allows the simulation of up to 34 fully entangled qubits on a single node using less than 270 GB of memory. Our experiments using different quantum algorithms show that QX achieves significant simulation speedup over similar state-of-the-art simulation environment.

Download Paper (PDF; Only available from the DATE venue WiFi)

ENERGY EFFICIENT STOCHASTIC COMPUTING WITH SOBOL SEQUENCES

Siting Liu, University of Alberta, CA

Abstract

Energy efficiency presents a significant challenge for stochastic computing (SC) due to the long random binary bit streams required for accurate computation. In this paper, a type of low discrepancy (LD) sequences, the Sobol sequence, is considered for energy-efficient implementations of SC circuits. The use of Sobol sequences improves the output accuracy of a stochastic circuit with a reduced sequence length compared to the use of another type of LD sequences, the Halton sequence, and conventional LFSR-generated pseudorandom sequences. The use of Sobol sequences leads to a similar or higher accuracy than using other types of LD sequences.

Download Paper (PDF; Only available from the DATE venue WiFi)

DESIGN AUTOMATION AND DESIGN SPACE EXPLORATION FOR QUANTUM COMPUTERS

Mathias Soeken, EPFL, CH

Abstract

A major hurdle to the deployment of quantum linear systems algorithms and recent quantum simulation algorithms lies in the difficulty to find inexpensive reversible circuits for arithmetic using existing hand coded methods. Motivated by recent advances in reversible logic synthesis, we synthesize arithmetic circuits using classical design automation flows and tools. The combination of classical and reversible logic synthesis enables the automatic design of large components in reversible logic starting from well-known hardware description languages such as Verilog. As a prototype example for our approach we automatically generate high quality networks for the reciprocal 1/x, which is necessary for quantum linear systems algorithms.

Download Paper (PDF; Only available from the DATE venue WiFi)

LOGIC ANALYSIS AND VERIFICATION OF N-INPUT GENETIC LOGIC CIRCUITS

Hasan Baig, Technical University of Denmark, DK

Abstract

Nature is using genetic logic circuits to regulate the fundamental processes of life. These genetic logic circuits are triggered by a combination of external signals, such as chemicals, proteins, light and temperature, to emit signals to control other gene expressions or metabolic pathways accordingly. As compared to electronic circuits, genetic circuits exhibit stochastic behavior and do not always behave as intended. Therefore, there is a growing interest in being able to analyze and verify the logical behavior of a genetic circuit model, prior to its physical implementation in a laboratory. In this paper, we present an approach to analyze and verify the Boolean logic of a genetic circuit from the data obtained through stochastic analog circuit simulations. The usefulness of this analysis is demonstrated through different case studies illustrating how our approach can be used to verify the expected behavior of an n-input genetic logic circuit.

Download Paper (PDF; Only available from the DATE venue WiFi)

End of session

Coffee Break in Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017

- Coffee Break 10:30 - 11:30
- Coffee Break 16:00 - 17:00

Wednesday, March 29, 2017

- Coffee Break 10:00 - 11:00
- Coffee Break 16:00 - 17:00

Thursday, March 30, 2017

- Coffee Break 10:00 - 11:00
- Coffee Break 15:30 - 16:00

5.3 Hot Topic Session: I'm Gonna Make an Approximation IoT Can't Refuse - Approximate Computing for Improving Power Efficiency of IoT and HPC

Date: Wednesday 29 March 2017
Time: 08:30 - 10:00
Location / Room: 2BC
Organiser: Vincent Camus, EPFL, CH
Power efficiency is the primary concern of IoT-related applications, both at the sensor node and on its cloud-computing counterpart. Unfortunately, achieving high efficiency and robustness requires complex and conflicting design constraints. Fortunately, the inherent error resiliency of many IoT applications allows the use of Approximate Computing techniques at both hardware and software levels, leading to great benefits on power efficiency while having a minimal impact on the applications.

### Time | Label | Presentation Title | Authors
---|---|---|---
08:30 | 5.3.1 | INTRODUCTION | Christian Enz, EPFL, CH
08:45 | 5.3.2 | PUSHING THE LIMITS OF VOLTAGE OVER-SCALING FOR ERROR-RESILIENT APPLICATIONS | Olivier Sentieys, INRIA, FR
                   Rengerajan Ragavan, Benjamin Barroso, Cedric Killian and Olivier Sentieys
                   
                   INRIA, FR; University of Rennes - INRIA, FR

**Abstract**
Voltage scaling has been used as a prominent technique to improve energy efficiency in digital systems, as reduction in the supply voltage effects in quadratic reduction in energy consumption of the system. The energy efficiency is achieved at the cost of timing errors in the system, that are corrected through additional error detection and correction circuits. In this paper we are proposing voltage over-scaling based approximate operators for applications that can tolerate errors. We characterize the basic arithmetic operators using different operating triads (combination of supply voltage, back biasing scheme and clock frequency) to generate models for approximate operators. Error-resilient applications can be mapped with the generated approximate operator models to achieve optimum trade-off between energy and error margin. Based on the dynamic speculation technique, best possible operating triad is chosen at runtime based on the user definable error tolerance margin of the application. In our experiments in 28nm FDSoI, we achieve maximum energy efficiency of 89% for basic operators like 8-bit and 16-bit adders at the cost of 20% Bit Error Rate (ratio of faulty bits over total bits) by operating them in near-threshold regime.

Download Paper (PDF; Only available from the DATE venue WiFi)

09:00 | 5.3.3 | COMBINING STRUCTURAL AND TIMING ERRORS IN OVERCLOCKED INEXACT SPECULATIVE ADDERS | Vincent Camus, EPFL, CH
                   Xun Jiao, Vincent Camus, Mattia Cacciotti, Yu Jiang, Christian Enz and Rajesh Gupta
                   
                   UC San Diego, US; EPFL, CH; Tsinghua University, CN

**Abstract**
Worst-case design is used in IoT devices and high performance data centers to ensure reliability by adding extra safety margin, leading to a power efficiency loss. Recently, approximate computing has been proposed to trade off accuracy for efficiency. In this paper, we use an inexact speculative adder, which redesigns the adder architecture by shortening the critical path to save power consumption. Its overdesign introduces structural errors due to carry speculation. On the other hand, overclocking is used to reduce conservative timing guardbands but could introduce timing errors. In this paper, we apply a supervised learning model to overclocked inexact speculative adders to predict timing errors at bit level. We analyze these two types of errors and examine the joint effects of them.

Download Paper (PDF; Only available from the DATE venue WiFi)

09:15 | 5.3.4 | DVAFS: TRADING COMPUTATIONAL ACCURACY FOR ENERGY THROUGH DYNAMIC-VOLTAGE-ACCURACY-FREQUENCY-SCALING | Bert Moons, Roel Uytterhoeven, Wim Dehaene and Marian Verhelst, Katholieke Universiteit Leuven, BE
                   Bert Moons, Virginia Tech, US; Roel Uytterhoeven, Katholieke Universiteit Leuven, BE
                   Maxime Pelcat, INRIA, FR; Vincent Camus, EPFL, CH

**Abstract**
Several applications in machine learning and machine-to-human interactions tolerate small deviations in their computations. Digital systems can exploit this fault-tolerance to increase their energy-efficiency, which is crucial in embedded applications. Hence, this paper introduces a new means of Approximate Computing: Dynamic-Voltage-Accuracy-Frequency-Scalining (DVAFS), a circuit-level technique enabling a dynamic trade-off of energy versus computational accuracy that outperforms other Approximate Computing techniques. The usage and applicability of DVAFS is illustrated in the context of Deep Neural Networks, the current state-of-the-art in advanced recognition. These networks are typically executed on CPU's or GPU's due to their high computational complexity, making their deployment on battery-constrained platforms only possible through wireless connections with the cloud. This work shows how deep learning can be brought to IoT devices by running every layer of the network at its optimal computational accuracy. Finally, we demonstrate a DVAFS processor for Convolutional Neural Networks, achieving efficiencies of multiple TOPS/W.

Download Paper (PDF; Only available from the DATE venue WiFi)

09:30 | 5.3.5 | EXPLOITING COMPUTATION SKIP TO REDUCE ENERGY CONSUMPTION BY APPROXIMATE COMPUTING, AN HEVC ENCODER CASE STUDY | Alexandre Merca, Justine Bonnot, Maxime Pelcat, Wassim Hamidouche and Daniel Menard
                   
                   INSA Rennes, FR; JETI-INSR, FR

**Abstract**
Approximate computing paradigm provides methods to optimize algorithms with considering both computational accuracy and complexity. This paradigm can be exploited at different levels of abstraction, from technological to application levels. Approximate computing at algorithm level aims at reducing computational complexity by approximating or skipping blocks function of the computation. Numerous applications in the signal and image processing domain integrate algorithms based on discrete optimization techniques. These techniques minimize a cost function by exploring the search space. In this paper, a new approach is proposed to exploit the computation-skipping approximate computing concept by using the SSRR technique. SSRR enables early selection of the best candidate configurations to reduce the search space. An efficient SSRR technique adjusts configuration selectivity to reduce execution complexity while selecting the functions most suitable to skip. The HEVC encoder in AI profile is used as a case study to illustrate the benefits of SSRR. In this application, two functions use discrete optimization to explore different solutions and select the one leading to the minimal cost in terms of bitrate/quality and computational energy: coding-tree partitioning and intra-mode prediction. By applying SSRR to this use case, energy reductions from 20% to 70% are explored through Pareto in Rate-Energy space.

Download Paper (PDF; Only available from the DATE venue WiFi)
The section introduces system-level frameworks for addressing memory tracing, timing estimation, real-time verification, and reliability degradation.

**5.4 Solutions for efficient simulation and validation**

**Date:** Wednesday 29 March 2017  
**Time:** 08:30 - 10:00  
**Location / Room:** 3A

**Chair:**  
Daniel Grosse, University of Bremen, DE

**Co-Chair:**  
Alper Sen, Bogazici University, TR

The section introduces system-level frameworks for addressing memory tracing, timing estimation, real-time verification, and reliability degradation.

### 5.4.1 PERFORMANCE IMPACTS AND LIMITATIONS OF HARDWARE MEMORY-ACCESS TRACE-COLLECTION

**Speaker:**  
Graham Holland, Simon Fraser University, CA

**Authors:**  
Nicholas C. Doyle\textsuperscript{1}, Eric Matthews\textsuperscript{1}, Graham Holland\textsuperscript{1}, Alexandra Fedorova\textsuperscript{2} and Lesley Shannon\textsuperscript{1}  
\textsuperscript{1}Simon Fraser University, CA; \textsuperscript{2}University of British Columbia, CA

**Abstract**  
In today's multicore architectures, complex interactions between applications in the memory system can have a significant, and highly variable, impact on application execution time. System designers typically use hardware counters to profile execution behaviours and diagnose performance problems. However, hardware counters are not always sufficient and some problems are best identified with full memory access traces. Collecting these traces in software is very expensive. Our work explores using dedicated hardware for memory-access trace collection. We focus on analyzing the limitations of hardware data collection and its impacts on application performance. The key feature of our study is that it is performed on actual hardware using two very different CPU platforms: 1) the PolyBlaze multicore soft processor and 2) the ARM Cortex-A9. In both cases, the data collection is implemented on an FPGA. Using micro-benchmarks designed to test the bounds of memory access behaviour, we illustrate the operational regions of data collection and the impact on system performance. By examining the bandwidth bottlenecks that limit the rate of data collection, as well as hardware architecture choices that can aggravate the impact on application performance, we provide guidelines that can be used to extrapolate our analysis to other systems and processor architectures.

Download Paper (PDF; Only available from the DATE venue WiFi)
A NOVEL WAY TO EFFICIENTLY SIMULATE COMPLEX FULL SYSTEMS INCORPORATING HARDWARE ACCELERATORS

Speaker:
Nikolaos Tampouratzis, Technical University of Crete, GR

Authors:
Nikolaos Tampouratzis\textsuperscript{1}, Konstantinos Georgopoulos\textsuperscript{2} and Ioannis Papaefstathiou\textsuperscript{3}
\textsuperscript{1}Technical University of Crete, GR; \textsuperscript{2}Telecommunication Communications Institute, Technical University of Crete, GR; \textsuperscript{3}Technical University of Crete, GR

Abstract
The breakdown of Dennard scaling coupled with the persistently growing transistor counts considerably increased the importance of application-specific hardware acceleration; such an approach offers significant performance and energy benefits compared to general-purpose solutions. In order to thoroughly evaluate such architectures, the designer should perform a quite extensive design space exploration so as to evaluate the tradeoffs across the entire system. The design, until recently, has been predominantly done using Register Transfer Level (RTL) languages such as Verilog and VHDL, which, however, lead to a prohibitively long and costly design effort. In order to reduce the design time a wide range of both commercial and academic High-Level Synthesis (HLS) tools have emerged; most of those tools, handle hardware accelerators that are described in synthesizable SystemC. The problem today, however, is that most synthesizers used for evaluating the complete user applications (i.e. full-system CPU/Mem/Peripheral simulators) lack any type of SystemC accelerator support. Within this context this paper presents a novel simulation environment comprised of a generic SystemC accelerator and probably the most widely known full-system simulator (i.e. GEMS). The proposed system is the only solution supporting the very important feature of global synchronization across the integrated simulation; furthermore it has been evaluated based on two different computationally intensive use cases and the final results demonstrate that the presented approach is orders of magnitude faster than the existing ones.

Download Paper (PDF; Only available from the DATE venue WiFi)
AUTOMATIC CONSTRUCTION OF MODELS FOR ANALYTIC SYSTEM-LEVEL DESIGN SPACE EXPLORATION PROBLEMS

Speaker:
Seyed-Hosein Attarzadeh-Niaki, Shahid Behesti University (SBU), IR
Authors:
Seyed-Hosein Attarzadeh-Niaki and Ingo Sander
Shahid Behesti University (SBU), IR; KTH Royal Institute of Technology, SE

Abstract
Due to the variety of application models and also the target platforms used in embedded electronic system design, it is challenging to formulate a generic and extensible analytic design-space exploration (DSE) framework. Current approaches support a restricted class of application and platform models and are difficult to extend. This paper proposes a framework for automatic construction of system-level DSE problem models based on a coherent, constraint-based representation of system functionality, flexible target platforms, and binding policies. Heterogeneous semantics is captured using constraints on logical clocks. The applicability of this method is demonstrated by constructing DSE problem models from different combinations of application and platform models. Time-triggered and untimed models of the system functionality and heterogeneous target platforms are used for this purpose. Another potential advantage of this approach is that constructed models can be solved using a variety of standard and ad-hoc solvers and search heuristics.

Download Paper (PDF; Only available from the DATE venue WiFi)

10:00 End of session
Coffee Break in Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017
- Coffee Break 10:30 - 11:30
- Coffee Break 16:00 - 17:00

Wednesday, March 29, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 16:00 - 17:00

Thursday, March 30, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 15:30 - 16:00

5.5 Hot Topic Session: Spintronics-based Computing

Date: Wednesday 29 March 2017
Time: 08:30 - 10:00
Location / Room: 3C

Organisers:
Lionel Torres, LIRMM, CNRS/University of Montpellier, FR
Weisheng Zhao, Beihang University, CN

Chair:
Lionel Torres, LIRMM, CNRS/University of Montpellier, FR

Co-Chair:
Weisheng Zhao, Beihang University, CN

Numerous reports or industrial and academic works on emerging research devices identified magnetic tunnel junction (MTJ) (one of applications of Spintronics) as one of the most promising technologies to be part of the future of integrated systems. They provide non-volatility data, fast data access and low power operations. Indeed, MRAM or Magnetic memory based on the hybrid integration of MTJ have been commercialized since 2006 and used in a number of high-reliable applications. The aim of this session is to bring together the worldwide leading experts (from respectively USA, France, China, Japan and Germany) related to this hot topic to share the most recent results and discuss the future challenges. Different computing paradigms will be involved in this special session benefiting from interesting nature of spintronics devices. The invited speakers will talk about devices, design and compact modeling aspects, and applications, permitting a full development platform from devices to circuit & systems based on spintronics.

08:30 5.5.1 MAGNETIC TUNNEL JUNCTION ENABLED ALL-SPIN STOCHASTIC SPIKING NEURAL NETWORK

Speaker:
Kaushik Roy, Purdue University, US

Authors:
Gopalakrishnan Srinivasan, Abhronil Sengupta and Kaushik Roy, Purdue University, US

Abstract
Biologically-inspired spiking neural networks (SNNs) have attracted significant research interest due to their inherent computational efficiency in performing classification and recognition tasks. The conventional CMOS-based implementations of large-scale SNNs are power intensive. This is a consequence of the fundamental mismatch between the technology used to realize the neurons and synapses, and the neuroscience mechanisms governing their operation, leading to area-expensive circuit designs. In this work, we present a three-terminal spintronic device, namely, the magnetic tunnel junction (MTJ)-heavy metal (HM) heterostructure that is inherently capable of emulating the neuronal and synaptic dynamics. We exploit the stochastic switching behavior of the MTJ in the presence of thermal noise to mimic the probabilistic spiking of cortical neurons, and the conditional change in the state of a binary synapse based on the pre- and post-synaptic spiking activity required for plasticity. We demonstrate the efficacy of a crossbar organization of our MTJ-HM based stochastic SNN in digit recognition using a comprehensive device-circuit-system simulation framework. The energy efficiency of the proposed system stems from the ultra-low switching energy of the MTJ-HM device, and the in-memory computation rendered possible by the localized arrangement of the computational units (neurons) and non-volatile synaptic memory in such crossbar architectures.

Download Paper (PDF; Only available from the DATE venue WiFi)
OPPORTUNISTIC WRITE FOR FAST AND RELIABLE STT-MRAM

Speaker:
Mehdi Tahoori, Karlsruhe Institute of Technology, DE

Abstract
Due to the stochastic switching behavior of the bitcell in Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM), an excessive write margin is required to guarantee an acceptable level of reliability and yield. This prevents the usage of STT-MRAM in fast memories such as L1 or L2 caches. The excessive write margin of STT-MRAM can be reduced to a large extent by an opportunistic write (i.e., terminating the write process before all bit switchings are completed) and by reducing thermal stability factor. The bits with unfinished writes have to be processed by robust Error Correction Codes (ECCs). However, such coding schemes have relatively large decoding latencies, which increases the overall read latency significantly. Moreover, thermally induced retention failures can limit the applicability of such schemes. In this paper, we exploit the fact that error detection is much faster than correction. Therefore, the errors can be detected quickly and all erroneous data can be reverted before they arrive critical parts of the system (e.g., commit stage or memory ports). We also provide an adaptive approach to manage temperature-dependent retention failures at runtime. Hence, our proposed approach enables the use of STT-MRAM technology for fast cache applications.

Download Paper (PDF; Only available from the DATE venue WiFi)
This session deals with 3D reliability and repair, integration of compression into standard test infrastructure, and reusing silicon debug infrastructure to enhance functional performance.

**5.6.1** Fault Clustering Technique for 3D Memory BISR

**Speaker:** Tianjian Li, Shanghai Jiao Tong University, CN

**Authors:**
- Tianjian Li
- Yan Han
- Xiaoyao Liang
- Hsien-Hsin S. Lee
- Li Jiang

**Abstract**
Three-dimensional (3D) memory has gained a great momentum because of its large storage capacity, bandwidth and etc. A critical challenge for 3D memory is the significant yield loss due to the disruptive integration process: any memory die that cannot be successfully repaired leads to the failure of the whole stack. The repair ratio of each die must be as high as possible to guarantee the overall yield. Existing memory repair methods, however, follow the traditional way of using redundancies: a redundant row/column replaces a row/column containing few or even one faulty cell. We propose a novel technique specifically in 3D memory that can overcome this limitation. It can cluster faulty cells across layers to the same row/column in the same memory array so that each redundant row/column can repair more “faults”. Moreover, it can be applied to the existing repair algorithms. We design the BIST and BISR modules to implement the proposed repair technique. Experimental results show more than 71% enhancement of the repair ratio over the global 3D GESP solution and 80% redundancy-cost reduction, respectively.

Download Paper (PDF; Only available from the DATE venue WiFi)

**5.6.2** Architectural Evaluations on TSV Redundancy for Reliability Enhancement

**Speaker:** Yen-Hao Chen, National Tsing Hua University, Taiwan, TW

**Authors:**
- Yen-liao Chen
- Chien-Pang Chu
- Russell Barnes
- TingTing Hwang

**Abstract**
Three-dimensional Integrated Circuits (3D-ICs) is a next-generation technology that could be a solution to overcome the scaling problem. It stacks dies with Through-Silicon Vias (TSVs) so that signals can be transmitted through dies vertically. However, researchers have noticed that the aging effect due to the electromigration (EN) may result in faulty TSVs and affect the chip lifetime [1]. Several redundant TSV architectures have been proposed to address this issue. By replacing the faulty TSV with redundant TSVs which are added at design time, chips can achieve better reliability and longer lifetime. In this paper, we will study the tradeoff of various redundant TSV architectures in terms of effectiveness and cost. To allow the measurement of reliability more realistically, we propose a new standard, repair rate, to appraise the redundant TSV architectures. Moreover, to design a more flexible and efficient structure, we enhance the ring-based design [2] that can adjust the size of the TSV block and TSV redundancy.

Download Paper (PDF; Only available from the DATE venue WiFi)

**5.6.3** Reusing Trace Buffers to Enhance Cache Performance

**Speaker:** Neetu Jindal, PhD, IN

**Authors:**
- Neetu Jindal
- Preeti Ranjan Panda
- Smruti R. Sarangi, Indian Institute of Technology Delhi, IN

**Abstract**
With the increasing complexity of modern Systems-on-Chip, the possibility of functional errors escaping design verification is growing. Post-silicon validation targets the discovery of these errors in early hardware prototypes. Due to limited visibility and observability, dedicated design-for-debug (DFD) hardware such as trace buffers are inserted to aid post-silicon validation. In spite of its benefit, such hardware incurs area overheads, which impose size limitations. However, the overhead could be overcome if the area dedicated to DFD could be reused in-field. In this work, we present a novel method for reusing an existing trace buffer as a victim cache of a processor to enhance performance. The trace buffer storage space is reused for the victim cache, with a small additional controller logic. Experimental results on several benchmarks and trace buffer sizes show that the proposed approach can enhance the average performance by up to 8.3% over a baseline architecture. We also propose a strategy for dynamic power management of the structure, to enable saving energy with negligible impact on performance.

Download Paper (PDF; Only available from the DATE venue WiFi)

**5.6.4** Optimization of Retargeting for IEEE 1149.1 Tap Controllers with Embedded Compression

**Speaker:** Sebastian Huhn, University of Bremen, DE

**Authors:**
- Sebastian Huhn
- Stephan Eggersgült
- Krishnendu Chakrabarty
- Rolf Drechsler

**Abstract**
We present a formal optimization technique that enables retargeting for codeword-based IEEE 1149.1-compliant TAP controllers. The proposed method addresses the problem of high test data volume and Test Application Time (TAT) for a system-on-chip design during board or in-field testing, as well as during debugging. This procedure determines an optimal set of codewords with respect to given hardware constraints, e.g., embedded dictionary size and the interface to the Test Data Register in the IEEE 1149.1 Std. A complete traversal of the spanned search space is possible through the use of formal methods. An optimal set of codewords can be determined, which is directly utilized for retargeting. The proposed method is evaluated using test data with high-entropy, which is known to be the least amenable to compression, as well as input data for debugging and Functional Verification (FV) test data. Our results show a compression ratio improvement of more than 30% and a reduction in TAT up to 20% compared to previous techniques.

Download Paper (PDF; Only available from the DATE venue WiFi)
## NOVEL MAGNETIC BURN-IN FOR RETENTION TESTING OF STTRAM

**Speaker:**
Swaroop Ghosh, Pennsylvania State University, US

**Authors:**
Mohammad Nasim Imtiaz Khan, Anirudh Iyengar and Swaroop Ghosh, Pennsylvania State University, US

**Abstract**
Spin-Transfer Torque RAM (STTRAM) is an emerging Non-Volatile Memory (NVM) technology that has drawn significant attention due to complete elimination of bitcell leakage. However, it brings new challenges in characterizing the retention time of the array during test. Significant shift of retention time under static (process variation (PV)) and dynamic (voltage, temperature fluctuation) variability furthers this issue. In this paper, we propose a novel magnetic burn-in (MBI) test which can be implemented with minimal changes in the existing test flow to enable STTRAM retention testing at short test time. The magnetic burn-in is also combined with thermal burn-in (MBI+BI) for further compression of retention and test time. Simulation results indicate MBI with 220Oe (at 25°C) can improve the test time by $3.71 \times 10^{13}$X while MBI+BI with 2200e at 125C can improve the test time by $1.97 \times 10^{14}$X.

## 5.7 Schedulability Analysis

**Date:** Wednesday 29 March 2017  
**Time:** 08:30 - 10:00  
**Location / Room:** 3B  

**Chair:** Petru Eles, Linköpings universitet, SE  
**Co-Chair:** Andreas Naderlinger, University of Salzburg, AT

The papers in this session introduce new schedulability analyses for real-time systems, including systems with precedence constraints, real-time networks-on-chip, and mixed-critical systems.

### 5.7.1 BOUNDING DEADLINE MISSES IN WEAKLY-HARD REAL-TIME SYSTEMS WITH TASK DEPENDENCIES

**Speaker:** Zain A. H. Hammadeh, TU Braunschweig, DE  
**Authors:** Zain A. H. Hammadeh$^1$, Sophie Quinton$^2$, Rolf Ernst$^3$, Rafik Henia$^3$ and Laurent Rioux$^3$  

$^1$TU Braunschweig, DE; $^2$Inria, FR; $^3$Thales Research & Technology, FR

**Abstract**
Real-time systems with functional dependencies between tasks often require end-to-end (as opposed to task-level) guarantees. For many of these systems, it is even possible to accept the possibility of longer end-to-end delays if one can bound their frequency. Such systems are called weakly-hard. In this paper we provide end-to-end deadline miss models for systems with task chains using Typical Worst-Case Analysis (TWCA). This bounds the number of potential deadline misses in a given sequence of activations of a task chain. To achieve this we exploit task chain properties which arise from the priority assignment of tasks in static-priority preemptive systems. This work is motivated by and validated on a realistic case study inspired by industrial practice and synthetic test cases.

### 5.7.2 REAL-TIME COMMUNICATION ANALYSIS FOR NETWORKS-ON-CHIP WITH BACKPRESSURE

**Speaker:** Sebastian Tobuschat, TU Braunschweig, DE  
**Authors:** Sebastian Tobuschat and Rolf Ernst, TU Braunschweig, DE

**Abstract**
Networks-on-Chip (NoCs) for safety-critical domains require formal guarantees for the worst-case behavior of all real-time senders. The majority of existing analysis approaches is capable of providing such guarantees only under the assumption that the queues in the routers never overflow, i.e., that no backpressure occurs. This leads to overly pessimistic guarantees or unfulfilled design requirements in many setups using commercially available NoCs where buffer space is limited. Therefore, we propose an alternative analysis methodology providing formal timing guarantees for packet latencies also in a NoC where backpressure occurs. The analysis allows exploiting the behavior of individual traffic streams to determine safe upper bounds on the latency of individual packets. The correctness of the analysis is evaluated experimentally through comparison with simulation results.

---

**End of session Coffee Break** in Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

- **Tuesday, March 28, 2017**
  - Coffee Break 10:30 - 11:30
  - Coffee Break 16:00 - 17:00
- **Wednesday, March 29, 2017**
  - Coffee Break 10:00 - 11:00
  - Coffee Break 16:00 - 17:00
- **Thursday, March 30, 2017**
  - Coffee Break 10:00 - 11:00
  - Coffee Break 15:30 - 16:00
09:30 5.7.3 PROBABILISTIC SCHEDULABILITY ANALYSIS FOR FIXED PRIORITY MIXED CRITICALITY REAL-TIME SYSTEMS

Speaker:
Yasmina Abdeddaïm, Université Paris-Est, LIGM, ESIEE Paris, FR

Authors:
Yasmina Abdeddaïm¹ and Dorin Maxim²
¹Université Paris-Est, LIGM, ESIEE-Paris, FR; ²University of Lorraine - Loria - Inria Nancy Grand Est, FR

Abstract
In this paper we present a probabilistic response time analysis for mixed criticality real-time systems running on a single processor according to a fixed priority pre-emptive scheduling policy. The analysis extends the existing state of the art probabilistic analysis to the case of mixed criticalities, taking into account both the level of assurance at which each task needs to be certified, as well as the possible criticalities at which the system may execute. The proposed analysis is formally presented as well as explained with the aid of an illustrative example.

Download Paper (PDF; Only available from the DATE venue WiFi)

10:00 End of session
Coffee Break in Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017
- Coffee Break 10:30 - 11:30
- Coffee Break 16:00 - 17:00

Wednesday, March 29, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 16:00 - 17:00

Thursday, March 30, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 15:30 - 16:00

IP2 Interactive Presentations

Date: Wednesday 29 March 2017
Time: 10:00 - 10:30
Location / Room: IP sessions (in front of rooms 4A and 5A)

Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the morning. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award ‘Best IP of the Day’ is given.

Label | Presentation Title
--- | ---
IP2-1 | COMPACT MODELING AND CIRCUIT-LEVEL SIMULATION OF SILICON NANOPHOTONIC INTERCONNECTS
Speaker:
Yuyang Wang, UC Santa Barbara, US
Authors:
Rui Wu, Yuyang Wang, Zeyu Zhang, Chong Zhang, Clint Schow, John Bowers and Kwang-Ting Cheng, UC Santa Barbara, US

Abstract
Nanophotonic interconnects have been playing an increasingly important role in the datacom regime. Greater integration of silicon photonics demands modeling and simulation support for design validation, optimization and design space exploration. In this work, we develop compact models for a number of key photonic devices, which are extensively validated by the measurement data of a fabricated optical network-on-chip (ONoC). Implemented in SPICE-compatible Verilog-A, the models are used in circuit-level simulations of full optical links. The simulation results match well with the measurement data. Our model library and simulation approach enable the electro-optical (EO) co-simulation, allowing designers to include photonic devices in the whole system design space, and to co-optimize the transmitter, interconnect, and receiver jointly.

Download Paper (PDF; Only available from the DATE venue WiFi)

IP2-2 | A TRUE RANDOM NUMBER GENERATOR BASED ON PARALLEL STT-MTJS
Speaker:
Yuanhuo Qu, University of Alberta, CA
Authors:
Yuanhuo Qu¹, Jie Han¹, Bruce Cockburn¹, Yue Zhang², Weisheng Zhao² and Witold Pedrycz³
¹University of Alberta, CA; ²Beihang University, CN

Abstract
Random number generators are an essential part of cryptographic systems. For the highest level of security, true random number generators (TRNG) are needed instead of pseudo-random number generators. In this paper, the stochastic behavior of the spin transfer torque magnetic tunnel junction (STT-MTJ) is utilized to produce a TRNG design. A parallel structure with multiple MTJs is proposed that minimizes device variation effects. The design is validated in a 28-nm CMOS process with Monte Carlo simulation using a compact model of the MTJ. The National Institute of Standards and Technology (NIST) statistical test suite is used to verify the randomness quality when generating encryption keys for the Transport Layer Security or Secure Sockets Layer (TLS/SSL) cryptographic protocol. This design has a generation speed of 177.8 Mbit/s, and an energy of 0.64 pJ is consumed to set up the state in one MTJ.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP2-3: ENABLING AREA EFFICIENT RF ICS THROUGH MONOLITHIC 3D INTEGRATION

Speaker: Panagiotis Chaourani, KTH, Royal Institute of Technology, Stockholm, SE
Authors: Panagiotis Chaourani, Per-Érik Hellström, Sauli Rodriguez, Rauli Onet and Ana Rusu, KTH, Royal Institute of Technology, SE

Abstract: The Monolithic 3D (M3D) integration technology has emerged as a promising alternative to dimensional scaling thanks to the unprecedented integration density capabilities and the low interconnect parasitics that it offers. In order to support technological investigations and enable future M3D circuits, M3D design methodologies, flows and tools are essential. Prospective M3D digital applications have attracted a lot of scientific interest. This paper identifies the potential of M3D RF/analog circuits and presents the first attempt to demonstrate such circuits. Towards this, a M3D custom design platform, which is fully compatible with commercial design tools, is proposed and validated. The design platform includes process characteristics, device models, LVS and DRC rules and a parasitic extraction flow. The envisioned M3D structure is built on a commercial CMOS process that serves as the bottom tier, whereas a SOI process is used as top tier. To validate the proposed design flow and to investigate the potential of M3D RF/analog circuits, a RF front-end design for Zig-Bee WPAN applications is used as case-study. The M3D RF front-end circuit achieves 35.5% area reduction, while showing similar performance with the original 2D circuit.

Download Paper (PDF; Only available from the DATE venue WiFi)

IP2-4: RECONFIGURABLE THRESHOLD LOGIC GATES USING OPTOELECTRONIC CAPACITORS

Speaker: Baris Taskin, Drexel University, US
Authors: Ragh Kutappa, Lunal Khuno, Bafram Nabet and Baris Taskin, Drexel University, US

Abstract: This paper investigates the integration of optoelectronic devices with CMOS threshold logic gates to design reconfigurable Boolean functions. The weight of the optoelectronic device can be altered by changing the optical power which is used to reconfigure the threshold logic (TL) gate. The proposed optoelectronic capacitor based TL (OECTL) gates are designed for i) simplistic AND/NAND gates and OR/NOR gates with large fan-in and ii) linearly separable Boolean functions that can be reconfigured to other linearly separable Boolean functions, constrained in reconfiguration by the specifics of TL operation. SPICE simulations in 65nm bulk CMOS technology with a Verilog-A model for the optoelectronic capacitor demonstrate i) AND/NAND gates and OR/NOR gates are 2X faster as fan-in increases and consumes low power ii) Boolean function can be reconfigured with 0.58X smaller delay and 0.46X lesser power of standard CMOS.

Download Paper (PDF; Only available from the DATE venue WiFi)

IP2-5: I-BEP: A NON-REDUNDANT AND HIGH-CONCURRENCY MEMORY PERSISTENCY MODEL

Speaker: Yuanchao Xu, Capital Normal University, CN
Authors: Yuanchao Xu, Zeyi Hou, Junfeng Yan, Lu Yang and Hu Wan, Capital Normal University, CN

Abstract: Byte-addressable, non-volatile memory (NVM) technologies enable fast persistent updates but incur potential data inconsistency upon a failure. Recent proposals present several persistency models to guarantee data consistency. However, they fail to express the minimal persist ordering as a result of inducing unnecessary ordering constraints. In this paper, we propose i-BEP, a non-redundant high concurrency memory persistency model, which expresses epoch dependency via persist directed acyclic graph instead of program order. Additionally, we propose two techniques, background persist and deferred eviction, to enhance the performance of i-BEP. We demonstrate that i-BEP can improve the performance by 15% for typical data structures on average over buffered epoch persistency (BEP) model.

Download Paper (PDF; Only available from the DATE venue WiFi)

IP2-6: SPMS: STRAND BASED PERSISTENT MEMORY SYSTEM

Speaker: Shuo Li, National University of Defense Technology, CN
Authors: Shuo Li1,2, Peng Wang3,4, Dong Xiao1, Guanyu Sun2 and Fang Liu1
1National University of Defense Technology, CN; 2National University of Defense Technology, CN; 3Peking University, CN; 4TU Braunschweig, DE

Abstract: Emerging non-volatile memories enable persistent memory, which offers the opportunity to directly access persistent data structures residing in main memory. In order to keep persistent data consistent in case of system failures, most prior work relies on persist ordering constraints which incurs significant overheads. Strand persistency minimizes persist ordering constraints. However, there is still no proposed persistent memory design based on strand persistency due to its implementation complexity. In this work, we propose a novel persistent memory system based on strand persistency, called SPMS. SPMS consists of cacheline-based strand group tracking components, a volatile strand buffer and ultra-capacitors incorporated in persistent memory modules. SPMS can track each strand and guarantee its atomicity. In case of system failures, committed strands buffered in the strand buffer can be flushed back to persistent memory within the residual energy window provided by the ultra-capacitors. Our evaluations show that SPMS outperforms the state-of-the-art persistent memory system by 6.6% and has slightly better performance than the baseline without any consistency guarantee. What's more, SPMS reduces the persistent memory write traffic by 30%/6, with the help of the strand buffer.

Download Paper (PDF; Only available from the DATE venue WiFi)

IP2-7: ARCHITECTING HIGH-SPEED COMMAND SCHEDULERS FOR OPEN-ROW REAL-TIME SRDRAM CONTROLLERS

Speaker: Leonardo Eco, TU Braunschweig, DE
Authors: Leonardo Eco1 and Rolf Ernst2
1Institute of Computer and Network Engineering, TU Braunschweig, DE; 2TU Braunschweig, DE

Abstract: As SRDRAM modules get faster and their data buses wider, researchers proposed the use of the open-row policy in command schedulers for real-time SRDRAM controllers. While the real-time properties of such schedulers have been thoroughly investigated, their hardware implementation was not. Hence, in this paper, we propose a highly-parallel and multi-stage architecture that implements a state-of-the-open-row real-time command scheduler. Moreover, we evaluate such architecture from the hardware overhead and performance perspectives.

Download Paper (PDF; Only available from the DATE venue WiFi)

IP2-8: AUTOMATIC EQUIVALENCE CHECKING FOR SYSTEMC-TLM 2.0 MODELS AGAINST THEIR FORMAL SPECIFICATIONS

Speaker: Mehran Golli, University of Bremen, DE
Authors: Mehran Golli, Jannis Stoppe and Rolf Drechsler, University of Bremen, DE

Abstract: The necessity to handle the increasing complexity of digital circuits has led to the usage of more and more abstract design paradigms. In particular, the Electronic System Level (ESL) has become an area of active research and industrial application, especially via SystemC and its Transaction Level Modeling (TLM) framework. Additionally, the usage of formal specification languages such as the Unified Modeling Language (UML) prior to the implementation (even at higher abstraction levels) is now a broadly accepted workflow. Utilizing this layered approach leaves the translation from the specification to the implementation to the designer, leaving the question unanswered how the equivalence of these should be verified. This paper proposes a novel, non-intrusive and broadly applicable approach to automatically validate the equivalence of the structural and behavioral information of a SystemC-TLM 2.0 model and its formal specification.

Download Paper (PDF; Only available from the DATE venue WiFi)
**IP2-9 (Best Paper Award Candidate)**

**HEAD-MOUNTED SENSORS AND WEARABLE COMPUTING FOR AUTOMATIC TUNNEL VISION ASSESSMENT**

**Speaker:** Josue Ortiz, Complutense University of Madrid, ES

**Authors:** Yuchao Ma and Hassan Ghasezamadeh, Washington State University, US

**Abstract**

As the second leading cause of blindness worldwide, glaucoma impacts a large population of individuals over 40. Although visual acuity often remains unaffected in early stages of the disease, visual field loss, expressed by tunnel vision condition, gradually increases. Glaucoma often remains undetected until it has moved into advanced stages. In this paper, we introduce a wearable system for automatic tunnel vision detection using head-mounted sensors and machine learning techniques. We develop several tasks, including reading and observation, and estimate visual field loss by analyzing user’s head movements while performing the tasks. An integrated computational module takes sensor signals as input, passes the data through several automatic data processing phases, and returns a final result by merging task-level predictions. For validation purposes, a series of experiments is conducted with 10 participants using tunnel vision simulators. Our results demonstrate that the proposed system can detect mild and moderate tunnel visions with an accuracy of 93.3% using a leave-one-subject-out analysis.

Download Paper (PDF; Only available from the DATE venue WiFi)

**IP2-10 (Best Paper Award Candidate)**

**RETRODMR: TROUBLESHOOTING NON-DETERMINISTIC FAULTS WITH RETROSPECTIVE DMR**

**Speaker:** Paolo Bernardi, Politecnico di Torino, IT

**Authors:** Paolo Bernardi, Davide Appello, Giampaolo Giacopelli, Alessandro Motta, Alberto Pagani, Giorgio Pollaccia, Christian Rabb, Marco Restifo, Prir Ruberg, Ernesti Sanchez, Claudia Mario Villa, and Federico Venini

**Abstract**

The thermal activity during testing can be considerably reduced by applying power-oriented filling of the unspecified bits of test vectors. However, traditional power-oriented X-fill methods do not correlate the thermal activity with delay failures, and they consume all the unspecified bits to reduce the power dissipation at every region of the core. Therefore, they adversely affect the un-modeled defect coverage of the generated test vectors. The proposed method identifies the unspecified bits that are more critical for delay failures, and it fills them in such a way as to create a thermal safe neighborhood around the most critical regions of the core. For the rest of the unspecified bits a probabilistic model based on output deviations is adopted to increase the un-modeled defect coverage of the test vectors. Experimental results show that the thermal activity and the inter-connection delays of critical regions of the core are comparable to those of the power-oriented X-fill methods, while the un-modeled defect coverage is as high as that of the random-fill method.

Download Paper (PDF; Only available from the DATE venue WiFi)

**IP2-11**

**A COMPREHENSIVE METHODOLOGY FOR STRESS PROCEDURES EVALUATION AND COMPARISON FOR BURN-IN OF AUTOMOTIVE SOC**

**Speaker:** Fotios Vartziotis, Computer Engineering, T.E.I. of Epirus, Greece, GR

**Authors:** Fotios Vartziotis1, and Chrysovalantis Kavousianos2

**Abstract**

Environmental and electrical stress phases are commonly applied to automotive devices during manufacturing test. The combination of thermal and electrical stress is used to give rise to early life latent failures that can be naturally found in a population of devices by accelerating aging processes through Burn-In test phases. This paper provides a methodology to evaluate and compare the stress procedure to be run during Burn-In; the proposed method takes into account several factors such as circuit activity, chip surface temperature and current consumption required by the stress procedure, and also considers Burn-In flow and tester limitations. A specific metric called Stress Coverage is suggested summing up all the stress contributions. Experimental results are gathered on an automotive device, showing the comparison between scan-based and functional stress run by a massively parallelized test equipment; reported figures and tables quantify the differences between the two approaches in terms of stress.

Download Paper (PDF; Only available from the DATE venue WiFi)

**IP2-12**

**ENERGY EFFICIENT STOCHASTIC COMPUTING WITH SOBOL SEQUENCES**

**Speaker:** Siting Liu, University of Alberta, CA

**Authors:** Siting Liu and Jie Han, University of Alberta, CA

**Abstract**

Energy efficiency presents a significant challenge for stochastic computing (SC) due to the long random binary bit streams required for accurate computation. In this paper, a type of low discrepancy (LD) sequences, the Sobol sequence, is considered for energy-efficient implementations of SC circuits. The use of Sobol sequences improves the output accuracy of a stochastic circuit with a reduced sequence length compared to the use of another type of LD sequences, the Halton sequence, and conventional LFSR-generated pseudorandom sequences. The use of Sobol sequences leads to a similar or higher accuracy than using Halton sequences for basic arithmetic operations. Sobol sequence generators cost less energy than the Halton counterparts when multiple random sequences are required in a circuit, thus the use of Sobol sequences can lead to a higher energy efficiency in an SC circuit than using Halton sequences.

Download Paper (PDF; Only available from the DATE venue WiFi)
LOGIC ANALYSIS AND VERIFICATION OF N-INPUT GENETIC LOGIC CIRCUITS

Speaker: Hasan Baig, Technical University of Denmark, DK
Authors: Hasan Baig and Jan Madsen, Technical University of Denmark, DK

Abstract: Nature is using genetic logic circuits to regulate the fundamental processes of life. These genetic logic circuits are triggered by a combination of external signals, such as chemicals, proteins, light and temperature, to emit signals to control other gene expressions or metabolic pathways accordingly. As compared to electronic circuits, genetic circuits exhibit stochastic behavior and do not always behave as intended. Therefore, there is a growing interest in being able to analyze and verify the logical behavior of a genetic circuit model, prior to its physical implementation in a laboratory. In this paper, we present an approach to analyze and verify the Boolean logic of a genetic circuit from the data obtained through stochastic analog circuit simulations. The usefulness of this analysis is demonstrated through different case studies illustrating how our approach can be used to verify the expected behavior of an n-input genetic logic circuit.

Download Paper (PDF; Only available from the DATE venue WiFi)

A NOVEL WAY TO EFFICIENTLY SIMULATE COMPLEX FULL SYSTEMS INCORPORATING HARDWARE ACCELERATORS

Speaker: Nikolaos Tampouratzis, Technical University of Crete, GR
Authors: Nikolaos Tampouratzis1, Konstantinos Georgopoulos2 and Ioannis Papaefthathio1
1Technical University of Crete, GR; 2Telecommunication Systems Institute, Technical University of Crete, GR

Abstract: The breakdown of Dennard scaling coupled with the persistently growing transistor counts severely increased the importance of application-specific hardware acceleration; such an approach offers significant performance and energy benefits compared to general-purpose solutions. In order to thoroughly evaluate such architectures, the designer should perform a quite extensive design space exploration so as to evaluate the tradeoffs across the entire system. The design, until recently, has been predominantly done using Register Transfer Level (RTL) languages such as Verilog and VHDL, which, however, lead to a prohibitively long and costly design effort. In order to reduce the design time a wide range of both commercial and academic High-Level Synthesis (HLS) tools have emerged; most of those tools, handle hardware accelerators that are described in synthesizable SystemC. The problem today, however, is that most simulators used for evaluating the complete user applications (i.e. full-system CPU/Mem/Peripheral simulators) lack any type of SystemC accelerator support. Within this context this paper presents a novel simulation environment comprised of a generic SystemC accelerator and probably the most widely known fullsystem simulator (i.e. GEN5). The proposed system is the only solution supporting the very important feature of global synchronization across the integrated simulation; furthermore it has been evaluated based on two different computationallyintensive use cases and the final results demonstrate that the presented approach is orders of magnitude faster than the existing ones.

Download Paper (PDF; Only available from the DATE venue WiFi)

AUTOMATIC ABSTRACTION OF MULTI-DISCIPLINE ANALOG MODELS FOR EFFICIENT FUNCTIONAL SIMULATION

Speaker: Franco Fummi, Università degli Studi di Verona, IT
Authors: Enrico Fraccaroli1, Michele Lora1 and Franco Fummi2
1University of Verona, IT; 2Università di Verona, IT

Abstract: Multi-discipline components introduce problems when inserted within virtual platforms of Smart Systems for functional validation. This paper lists the most common emerging problems and it proposes a novel approach to solve them. It presents a set of techniques, unified in an automatic abstraction methodology, useful to achieve fast analog mixed-signal simulation even when different physical disciplines and modeling styles are combined into a single analog model. The paper makes use of a complex case study. It deals with multiple-discipline descriptions, non-electrical conservative models, non-linear equation systems, and mixed time/frequency domain models. The original component behavior has been modeled in Verilog-AMS by using electrical, mechanical and kinematic equations. Then, it has been abstracted and integrated within a virtual platform of a mixed-signal smart system for efficient functional simulation.

Download Paper (PDF; Only available from the DATE venue WiFi)

NOVEL MAGNETIC BURN-IN FOR RETENTION TESTING OF STTRAM

Speaker: Swaroop Ghosh, Pennsylvania State University, US
Authors: Mohammad Nasim Imtiaz Khan, Anirudh Iyengar and Swaroop Ghosh, Pennsylvania State University, US

Abstract: Spin-Transfer Torque RAM (STTRAM) is an emerging Non-Volatile Memory (NVM) technology that has drawn significant attention due to complete elimination of bitcell leakage. However, it brings new challenges in characterizing the retention time of the array during test. Significant shift of retention time under static (process variation (PV)) and dynamic (voltage, temperature fluctuation) variability further this issue. In this paper, we propose a novel mag-netic burn-in (MBI) test which can be implemented with minimal changes in the existing test flow to enable STTRAM retention testing at short test time. The magnetic burn-in is also combined with thermal burn-in (MBI+BI) for further compression of retention and test time. Simula-tion results indicate MBI with 220Oe (at 25C) can improve the test time by 3.71x10^13 X while MBI+BI with 220Oe at 125C can improve the test time by 3.71x10^14 X. MBI and MBI+BI can be implemented with minimal changes in the existing test flow to enable STTRAM retention testing at short test time. The magnetic burn-in is also combined with thermal burn-in (MBI+BI) for further compression of retention and test time. Simulation results indicate MBI with 220Oe (at 25C) can improve the test time by 3.71x10^13 X while MBI+BI with 220Oe at 125C can improve the test time by 3.71x10^14 X.

Download Paper (PDF; Only available from the DATE venue WiFi)

AUTOMATIC CONSTRUCTION OF MODELS FOR ANALYTIC SYSTEM-LEVEL DESIGN SPACE EXPLORATION PROBLEMS

Speaker: Seyed-Hosein Attarzadeh-Niazi, Shahid Beheshti University (SBU), IR
Authors: Seyed-Hosein Attarzadeh-Niazi1 and Ingo Sander2
1Shahid Beheshti University (SBU), IR; 2KTH Royal Institute of Technology, SE

Abstract: Due to the variety of application models and also the target platforms used in embedded electronic system design, it is challenging to formulate a generic and extensible analytic design-space exploration (DSE) framework. Current approaches support a restricted class of application and platform models and are difficult to extend. This paper proposes a framework for automatic construction of system-level DSE problem models based on a coherent, constraint-based representation of system functionality, flexible target platforms, and binding policies. Heterogeneous semantics is captured using constraints on logical clocks. The applicability of this method is demonstrated by constructing DSE problem models from different combinations of application and platforms models. Time-triggered and untimed models of the system functionality and heterogeneous target platforms are used for this purpose. Another potential advantage of this approach is that constructed models can be solved using a variety of standard and ad-hoc solvers and search heuristics.

Download Paper (PDF; Only available from the DATE venue WiFi)
UB05.1 NOXIM-XT: A BIT-ACCURATE POWER ESTIMATION SIMULATOR FOR NOCS

**Presenter:** Pierre Bornel, Université de Bretagne Sud, FR

**Authors:** André Ross³, Johann Laurent² and Erwan Morere⁴

³LERIA, Université d’Angers, Angers, France, FR; ²Lab-STICC, Université de Bretagne Sud, Lorient, FR

**Abstract**
We have developed an enhanced version of Noxim (Noxim-XT) to estimate the energy consumption of a NoC in a SOC. Noxim-XT is used in a two-step methodology. First, applications are mapped on a SoC and their traffics are extracted by simulation with MPSOCBench. Second, Noxim-XT tests various hardware configurations of the NoC, and for each configuration, the application’s traffic is re-injected and replayed, an accurate performance and power breakdown is provided, and the user can choose different data coding strategies. With the help of Noxim XT, each configuration is BR-accurately estimated in terms of energy consumption. After simulation, a spatial mapping of the energy consumption is provided and highlights the hot-spots. Moreover, the new coding strategies allows significant energy saving. Noxim XT simulations and a FPGA-based prototype of a new coding strategy will be demonstrated at the U-booth to illustrate these works.

*More information...*

UB05.2 RIMEDIO: WHEELCHAIR MOUNTED ROBOTIC ARM DEMONSTRATOR FOR PEOPLE WITH MOTOR SKILLS IMPAIRMENTS

**Presenter:** Alessandro Palla, University of Pisa, IT

**Authors:** Gabriele Meoni and Luca Fanucci, University of Pisa, IT

**Abstract**
People with reduced mobility experiment many issues in the interaction with the indoor and outdoor environment because of their disability. For those users even the simplest action might be a hard/impossible task to perform without the assistance of an external aid. We propose a simple and lightweight wheelchair mounted robotic arm with the focus on the human-machine interface that has to be simple and accessible for users with different kind of disabilities. The robotic arm is equipped with a 5 MP camera, force and proximity sensors and a 6 axis Inertial Measurement Unit on the end-effector that can be controlled using an app running on a tablet. When the user selects the object to reach (for instance a button) on the tablet screen, the arm autonomously carries out the task, using the camera image and the sensors measurements for autonomous navigation. The demonstrator consists in the robotic arm prototype, the Android tablet and a personal computer for arm setup and configuration.

*More information...*

UB05.3 NNDNN: NEURAL NETWORKS DESIGNING NEURAL NETWORKS

**Presenter:** Brett Meyer, McGill University, US

**Authors:** Warren Gross, Sean Smithson, Ossama Ahmed and Guang Yang, McGill University, US

**Abstract**
Modern artificial neural networks currently achieve state-of-the-art results in various difficult problems, including image classification and speech recognition. However, both the performance and computational complexity of such models are heavily dependent on the design of characteristic hyper-parameters (e.g., numbers of hidden layers or nodes per layer) which are often manually optimized. With neural networks penetrating low-power mobile and embedded areas, the need now arises to optimize not only for performance, but also for implementation cost. In our work, we present a multi-objective design space exploration method leveraging machine learning based response surface modelling to reduce the number of solutions trained and evaluated. Experimental results are presented for several image recognition datasets, demonstrating the evolution of the approximated Pareto-optimal hyper-parameters and corresponding GPU code; all while exploring only a small fraction of the design space.

*More information...*

UB05.4 MATISSE: A TARGET-AWARE COMPILER TO TRANSLATE MATLAB INTO C AND OPENCL

**Presenter:** Luís Reis, University of Porto, PT

**Authors:** João Bispo and João Cardoso, University of Porto / INESC-TEC, PT

**Abstract**
Many engineering, scientific and finance algorithms are prototyped and validated in array languages, such as MATLAB, before being converted to other languages such as C for use in production. As such, there has been substantial effort to develop compilers to perform this translation automatically. Alternative types of computation devices, such as GPUs and FPGAs, are becoming increasingly more popular, so it becomes critical to develop compilers that target these architectures. We have adapted MATISSE, our MATLAB-compatible compiler framework, to generate C and OpenCL code for these platforms. In this demonstration, we will show how our compiler works and what its capabilities are. We will also describe the main challenges of efficient code generation from MATLAB and how to overcome them.

*More information...*

UB05.5 SCCHARTS: SYNCHRONOUS STATECHARTS FOR SAFETY-CRITICAL APPLICATIONS

**Presenter:** Reinhard von Hanxleden, Kiel University, DE

**Authors:** Michael Medler³, Christian Motika², Christoph Daniel Schulze² and Steven Smyth²

³Bamberg University, DE; ²Kiel University, DE

**Abstract**
We present a visual language, SCCharts, designed for specifying safety-critical reactive systems. SCCharts use a statechart notation and provide determinate concurrency based on a synchronous model of computation (MoC), without restrictions common to previous synchronous MoCs. Specifically, we lift earlier limitations on sequential accesses to shared variables, by leveraging the sequentially constructive MoC. For further details, see [von Hanxleden et al., PLDI’14](http://www.sccharts.com) and http://www.sccharts.com. The SCCharts demonstrator is an Eclipse Richt Client and part of KIELER (http://www.rtsys.informatik.uni-kiel.de/en/research/kieler). The demonstration shows how to write an SCChart model using a textual notation, from which a visual model is generated on the fly using the Eclipse Layout Kernel (ELK). We also present a compilation chain that allows efficient synthesis of software and hardware.

*More information...*

UB05.6 MULTI-CORE VERIFICATION: COMBINING MICROTESK AND SPIN FOR VERIFICATION OF MULTI-CORE MICROPROCESSORS

**Presenter:** Mikhail Chupikko, ISPRAS, RU

**Authors:** Alexander Kamkin, Mikhail Lebedev and Andrei Tatarnikov, ISPRAS, RU

**Abstract**
The complexity of modern cache coherence protocols (CCP) in multi-core microprocessors prevents from complete verification of shared memory subsystems by means of random test-program generators (TPG). The following steps are suggested to target the problem. The first step is to separately specify CCP features and generate CCP-specific events to be used in TPG when generating a test program (TP). The protocol is specified in Promela, with Spin making a test template (TT). Spin also produces UVR (or C+TESK) testbench to make the execution of the resulting TPs to be controllable and deterministic. The second step is to let TPG produce the memory access instructions causing desired CCP-specific behavior. As a TPG we use MicroTESK. Its Ruby-based TTs abstractly describe future TPs. MicroTESK processes that TP making TP with CCP-specific events. The resulting TP is executed together with the testbench to exactly reproduce the situation Spin had found to be important for such a protocol.

*More information...*
The challenging aspects of IoT enabling technologies include the lowest cost, power dissipation, dependability, security, and the ability to integrate heterogeneous devices and technologies. This session presents research-oriented perspectives on overcoming these challenges.

**6.1 IoT Day Hot Topic Session: IoT Enabling Technologies**

**Date:** Wednesday 29 March 2017  
**Time:** 11:00 - 12:30  
**Location / Room:** SBC

**Organisers:**  
Marilyn Wolf, Georgia Tech, US  
Andreas Herkersdorf, TU Muenchen, DE

**Chair:**  
Andreas Herkersdorf, TU Muenchen, DE

**Co-Chair:**  
Marilyn Wolf, Georgia Tech, US

The introduction and broad scale rollout of IoT applications put pressing demands on semiconductor base technologies for computation, communication and sensing in terms of lowest cost, power dissipation, dependability, security and the ability to integrate heterogeneous devices and technologies. This session presents three research-oriented perspectives on the challenging aspects of IoT enabling technologies.
<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>11:00</td>
<td>6.1.1</td>
<td>ULTRA-LOW-POWER CIRCUITS FOR IOT APPLICATIONS</td>
<td>Georges Gielen, Katholieke Universiteit Leuven, BE</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td>IoT applications require ultra-low-power hardware solutions that communicate wirelessly. Challenges and some solutions in designing these will be highlighted.</td>
</tr>
<tr>
<td>11:30</td>
<td>6.1.2</td>
<td>STRUCTURAL HEALTH MONITORING FOR SMART CITIES: A HW/SW CODESIGN PERSPECTIVE</td>
<td>Jiang Xu, Hong Kong University of Science and Technology, HK</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td>The structural integrity of civil structures is vital to economic prosperity and public safety. In developed countries and regions, a large number of transportation and residential infrastructures are aging rapidly. There is an urgent need and rapidly increasing demand for the ability to monitor the health conditions of civil structures in a real-time and distributed manner. This talk will share our experiences on developing large scale structural health monitoring systems from a HW/SW codesign perspective</td>
</tr>
<tr>
<td>12:00</td>
<td>6.1.3</td>
<td>SECURITY IN THE INTERNET OF THINGS: A CHALLENGE OF SCALE</td>
<td>Patrick Schaumont, Virginia Tech, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speaker and Author:</td>
<td>Patrick Schaumont, Virginia Tech, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td>Technological scaling has offered a windfall of benefits to electronics design. Increased transistor density has offered an exponential increase in computing capabilities over time, but without a corresponding increase in system cost. Information security has its own success story with scaling. Cryptographic algorithms become exponentially harder to break through a mere linear increase in encryption complexity or in key-length. In the Internet of Things, scaling is as much a security liability as it is an advantage. These security liabilities are new, poorly understood and poorly regulated. Some examples include the following: privacy of IoT data in the cloud; the safety consequences of poor information security in cyber-physical systems; the liabilities of long-lifetime devices that use outdated or poorly tested information security; the performance-limited information security in devices that run on the outskirts of the IoT using nothing but harvested energy. In this contribution we consider the security landscape for IoT. We consider the technological consequences of securely extending the Internet into the physical world of things. We identify current limitations, ongoing research efforts, and open challenges for the design community.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Download Paper (PDF; Only available from the DATE venue WiFi)</td>
<td>Download Paper (PDF; Only available from the DATE venue WiFi)</td>
</tr>
</tbody>
</table>


Date: Wednesday 29 March 2017
Time: 11:00 - 12:30
Location / Room: 4BC
Chair: Jamil Kawa, Synopsys, US

In this executive session, we will discuss the prominent features and requirements of today’s autonomously powered systems and deliberate over various visions of what needs to happen next to take autonomously powered systems from their embryonic state to an advanced efficient state that is well thought through and efficiently architectured.

Moderator:
- Jamil Kawa, Synopsys, US

Panelists:
- Mario Konijnenburg, IMEC, BE
- Christoph Heer, Intel, DE
- Yankin Tanurhan, Synopsys, US
- Ali Keshavarzi, Cypress Semiconductor, US

12:30 End of session
Lunch Break in Garden Foyer
Keynote Lecture session 7.0 in “Garden Foyer” 1350 - 1420

Lunch Break in the Garden Foyer
On all conference days (Tuesday to Thursday), a buffet lunch will be offered in the Garden Foyer, in front of the session rooms. Kindly note that this is restricted to conference delegates possessing a lunch voucher only. When entering the lunch break area, delegates will be asked to present the corresponding lunch voucher of the day. Once the lunch area is being left, re-entrance is not allowed for the respective lunch.

6.3 Security Primitives

Date: Wednesday 29 March 2017
Time: 11:00 - 12:30
Location / Room: 2BC
Chair: Berndt Gammel, Infineon Technologies, DE
Co-Chair: Tim Güneysu, University of Bremen & DFKI, DE

This session discusses the implementation of basic primitives that are necessary building blocks for the secure systems: Physical unclonable functions (PUFs) are used for creating secret values which then are used as keys in cryptographic algorithms. Logical and physical security of these systems fundamentally relies on the presence of high quality random numbers.
11:00 6.3.1 SENSITIZED PATH PUF: A LIGHTWEIGHT EMBEDDED PHYSICAL UNCLONABLE FUNCTION
Speaker: Matthias Sauer, University of Freiburg, DE
Authors: Matthias Sauer, Pascal Raiola, Linus Feiten, Bernd Becker, Ulrich Rührmair and Ilia Polian
1University of Freiburg, DE; 2TU München, DE; 3University of Passau, DE
Abstract
Physical unclonable functions (PUFs) can be used for a number of security applications, including secure on-chip generation of secret keys. We introduce an embedded PUF concept called sensitized path PUF (SP-PUF) that is based on extracting entropy out of inherent timing variability of modules already present in the circuit. The new PUF sensitizes paths of nearly identical lengths and generates response bits by racing transitions through different paths against each other. SP-PUF has lower area overhead and higher speed than earlier embedded PUFs and requires no helper data stored in non-volatile memory beyond standard error-correction information for fuzzy extraction. Compared with standalone PUFs, the new solution intrinsically and inseparably intertwines PUF behavior with functional circuitry, thus complicating invasive attacks or simplifying their detection. Moreover, SP-PUF can naturally define the contribution of a digital block to a system-wide "fusion PUF." We present a systematic design flow to turn an arbitrary (sufficiently complex) circuit into an SP-PUF. The flow leverages state-of-the-art sensitization algorithms, formal filtering based on statistical analysis, and MAXSAT-based optimization of SP-PUF's area overhead. Experiments show that SP-PUF extracts 256-bit keys with perfect reliability and nearly perfect uniqueness after fuzzy extraction for the majority of standard benchmarks circuits.
Download Paper (PDF; Only available from the DATE venue WiFi)

11:30 6.3.2 TEMPERATURE AWARE PHASE/FREQUENCY DETECTOR-BASED RO-PUFS EXPLOITING BULK-CONTROLLED OSCILLATORS
Speaker: Sha Tao, Royal Institute of Technology (KTH), SE
Authors: Sha Tao and Elena Dubrova, Royal Institute of Technology (KTH), SE
Abstract
Physical unclonable functions (PUFs) are promising hardware security primitives suitable for low-cost cryptographic applications. Ring oscillator (RO) PUF is a well-received silicon PUF solution due to its ease of implementation and entropy evaluation. However, the responses of RO-PUFs are susceptible to environmental changes, in particular, to temperature variations. Additionally, a conventional RO-PUF implementation is usually more power-hungry than other PUF alternatives. This paper explores circuit-level techniques to design low-power RO-PUFs with enhanced thermal stability. We introduce a power-efficient approach based on a phase/frequency detector (PFD) to perform pairwise comparisons of ROs. We also propose a temperature compensated bulk-controlled oscillator (BCO) and investigate its feasibility and usage in PFD-based RO-PUFs. Evaluation results demonstrate that the proposed techniques can effectively reduce the thermally induced errors in PUF responses while imposing a low power overhead. The PFD-based BCO-PUF is one of the best among existing RO-PUFs in terms of power efficiency.
Download Paper (PDF; Only available from the DATE venue WiFi)

11:40 6.3.3 CHACHA20-POLY1305 AUTHENTICATED ENCRYPTION FOR HIGH-SPEED EMBEDDED IOT APPLICATIONS
Speaker: Fabrizio De Santis, Technische Universität München, DE
Authors: Fabrizio De Santis, Andreas Schauer and Georg Sigl, Technische Universität München, DE
Abstract
The ChaCha20 stream cipher and the Poly1305 authenticator are cryptographic algorithms designed by Daniel J. Bernstein with the aim of ensuring high-performance and memory footprint figures for different security parameters, as well as energy consumption in a resource constrained microcontroller to backup these claims. Furthermore, to the best of our knowledge, in this work we present the first time-independent implementation of NTRUEncrypt. We describe four different NTRUEncrypt implementations feasibilities of employing the NTRU encryption scheme, NTRUEncrypt, in resource constrained devices such as those used for Internet-of-Things endpoints. We present an analysis of NTRUEncrypt's advantages over other cryptosystems for use in such devices. We describe four different NTRUEncrypt implementations. We present a systematic design flow to turn an arbitrary (sufficiently complex) circuit into an SP-PUF. The flow leverages state-of-the-art sensitization algorithms, formal filtering based on statistical analysis, and MAXSAT-based optimization of SP-PUF's area overhead. Experiments show that SP-PUF extracts 256-bit keys with perfect reliability and nearly perfect uniqueness after fuzzy extraction for the majority of standard benchmarks circuits.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:00 6.3.4 TOWARDS POST-QUANTUM SECURITY FOR IOT ENDPOINTS WITH NTRU
Speaker: Johanna Sepulveda, TU Munich, DE
Authors: Oscar M. Guillen, Thomas Pöppelmann, Jose M. Bermudo Mera, Georg Sigl and Johanna Sepulveda
1TU München, DE; 2Infineon Technologies, DE; 3Radboud University, NL
Abstract
The NTRUEncrypt cryptosystem is one of the main alternatives for practical implementations of post-quantum, public-key cryptography. In this work, we analyze the feasibility of employing the NTRU encryption scheme, NTRUencrypt, in resource constrained devices such as those used for Internet-of-Things endpoints. We present an analysis of NTRUencrypt’s advantages over other cryptosystems for use in such devices. We describe four different NTRUencrypt implementations on an ARM Cortex M0-based microcontroller, compare their results, and show that NTRUencrypt is suitable for use in battery-operated devices. We present performance and memory footprint figures for different security parameters, as well as energy consumption in a resource constrained microcontroller to backup these claims. Furthermore, to the best of our knowledge, in this work we present the first time-independent implementation of NTRUencrypt.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30 LEVERAGING AGING EFFECT TO IMPROVE SRAM-BASED TRUE RANDOM NUMBER GENERATORS
Speaker: Mohammad Saber Golanbari, Karlsruhe Institute of Technology (KIT), DE
Authors: Saman Kiani1, Mohammad Saber Golanbari2 and Mehdi Tahoori
1Karlsruhe Institute of Technology (KIT), DE; 2Karlsruhe Institute of Technology, DE
Abstract
The start-up value of SRAM cells can be used as the random number vector or as a seed for the generation of a pseudo random number. However, the randomness of the generated number is pretty low since many of the cells are largely skewed due to process variation and some of them start toward zero or one. In this paper, we propose an approach to increase the randomness of SRAM-based True Random Number Generators (TRNGs) by leveraging transistor aging impact. The idea is to iteratively power-up the SRAM cells and put them under accelerated aging to make the cells less skewed and hence obtaining a more random vector. The simulation results show that the min-entropy of SRAM-based TRNG increases by 10X using this approach.
Download Paper (PDF; Only available from the DATE venue WiFi)
OPERAND SIZE RECONFIGURATION FOR BIG DATA PROCESSING IN MEMORY

Speaker:
Luigi Carro, UFRGS, BR
Authors:
Paulo Cesar Santos¹, Geraldo Francisco de Oliveira Junior², Diego Gomes Tomé³, Marco Antonio Zanata Alves³, Eduardo Cunha de Almeida³ and Luigi Carro⁴
¹UFRGS - Universidade Federal do Rio Grande do Sul, BR; ²Universidade Federal do Rio Grande do Sul, BR; ³UFRPR, BR; ⁴UFRGS, BR

Abstract
Nowadays, applications that predominantly perform lookups over large databases are becoming more popular with column-stores as the database system architecture of choice. For these applications, Hybrid Memory Cubes (HMCs) can provide bandwidths of up to 320 GB/s and represents the best choice to keep the throughput for these ever increasing databases. However, even with the high available memory bandwidth and processing power, in order to achieve the peak performance, data movements through the memory hierarchy consumes an unnecessary amount of time and energy. In order to accelerate database operations, and reduce the energy consumption of the system, this paper presents the Reconfigurable Vector Unit (RVU) that enables massive and adaptive in-memory processing, extending the native HMC instructions and also increasing its effectiveness. RVU enables the programmer to reconfigure it to perform as a large vector unit or multiple small vector units to better adjust for the application needs during different computation phases. Due to its adaptability, RVU is capable of achieving performance increase of 27x on average and reduce the DRAM energy consumption in 29% when compared to an x86 processor with 16 cores. Compared with the state-of-the-art mechanism capable of performing large vector operations with fixed size, inside the HMC, RVU performed up to 12% better in terms of performance and improve in 53% the energy consumption.

Download Paper (PDF; Only available from the DATE venue WiFi)
Koen Bertels, Delft University of Technology, NL

Co-Chair:

Akash Kumar, Technische Universitaet Dresden, DE

Chair:

Said Hamdioui, Delft University of Technology, NL

Organisers:

Koen Bertels, Delft University of Technology, NL

Said Hamdioui, Delft University of Technology, NL

6.5 Hot Topic Session: Memristor for Computing: Myth or Reality?

Date: Wednesday 29 March 2017

Time: 11:00 - 12:30

Location / Room: 3C

Organisers:

Koen Bertels, Delft University of Technology, NL

Said Hamdioui, Delft University of Technology, NL

Chair:

Akash Kumar, Technische Universitaet Dresden, DE

Co-Chair:

Koen Bertels, Delft University of Technology, NL

Both today’s technology and computer architectures are facing serious challenges/walls making them incapable to deliver the right computing power at pre-defined constraints for emerging applications such as big-data. However, a solution may be at your fingertips. This session discusses the emerging memristor device in enabling new memory technologies and new logic design styles, as well as its potential in enabling new computing paradigms such as memory intensive architectures and neuromorphic computing, due to its unique properties like the tight integration with CMOS and the ability to learn and adapt.

Time | Label | Presentation Title | Authors
---|---|---|---
11:00 | 6.5.1 | MEMRISTOR: WHAT IS IT ABOUT AND WHAT IS ITS POTENTIAL? | Said Hamdioui, Delft University of Technology, NL

11:30 | 6.5.2 | MEMRISTOR FOR MEMORY-INTENSIVE ARCHITECTURES | Shahar Kvatinsky, Technion/Israel Institute of Technology, IL
6.6 Industrial Experiences & EU Projects

Date: Wednesday 29 March 2017
Location / Room: SA
Chair: Eugenio Villar, University of Cantabria, ES

This session addresses industrial research and practice on architecture, design, timing analysis techniques and analogue circuit sizing. The session will be rounded off by presentations of two European projects about to start, addressing cross-layer design of reconfigurable CPS and IoT for smart wearable applications.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
</tr>
</thead>
<tbody>
<tr>
<td>11:00</td>
<td>6.6.1</td>
<td>AN ASYNCHRONOUS NOC ROUTER IN A 14NM FINFET LIBRARY: COMPARISON TO AN INDUSTRIAL SYNCHRONOUS COUNTERPART</td>
</tr>
<tr>
<td>11:15</td>
<td>6.6.2</td>
<td>AN ADVANCED EMBEDDED ARCHITECTURE FOR CONNECTED COMPONENT ANALYSIS IN INDUSTRIAL APPLICATIONS</td>
</tr>
<tr>
<td>11:30</td>
<td>6.6.3</td>
<td>WORKLOAD DEPENDENT RELIABILITY TIMING ANALYSIS FLOW</td>
</tr>
</tbody>
</table>

---

**AN ASYNCHRONOUS NOC ROUTER IN A 14NM FINFET LIBRARY: COMPARISON TO AN INDUSTRIAL SYNCHRONOUS COUNTERPART**

Speaker: Wayne Burleson, Advanced Micro Devices, Inc., US
Authors: Weiwei Jiang¹, Davide Bertozzi², Gabriele Miorandi², Steven M. Nowick¹, Wayne Burleson³ and Greg Sadowski³
¹Columbia University, US; ²University of Ferrara, IT; ³Advanced Micro Devices, US

Abstract
An asynchronous high-performance low-power 5-port network-on-chip (NoC) router is introduced. The proposed router integrates low-latency input buffers using a circular FIFO design, and a novel end-to-end credit-based virtual channel (VC) flow control for a replicated switch architecture. This asynchronous router is then compared to an AMD synchronous router, in a realistic advanced 14nm FinFET library. This is the first such comparison, to the best of our knowledge, using a real synchronous router baseline already fabricated in several commercial products. Initial post-synthesis pre-layout experiments show dominating results for the asynchronous router, when compared to the synchronous router. In particular, 55% less area and 28% latency improvement are observed for the asynchronous implementation. Also, 88% and 58% savings in idle and active power, respectively, are obtained.

Download Paper (PDF; Only available from the DATE venue WiFi)

**AN ADVANCED EMBEDDED ARCHITECTURE FOR CONNECTED COMPONENT ANALYSIS IN INDUSTRIAL APPLICATIONS**

Speaker: Menbere Tekleyohannes, University of Kaiserslautern, DE
Authors: Menbere Tekleyohannes¹, MohammadSadegh Sadrí¹, Martin Klein², Michael Siegrist², Christian Weis¹ and Norbert Wehn¹
¹University of Kaiserslautern, DE; ²Wipotec GmbH, DE

Abstract
In recent years, connected component analysis (CCA) has become one of the vital image/video processing algorithms due to its wide-range applicability in the field of computer vision. Numerous applications such as pattern recognition, object detection and image segmentation involve connected component analysis. In the context of camera-based inspection systems, CCA plays an important role for quality assurance. State-of-the-art hardware architectures offer high performance implementations of CCA using field programmable gate arrays (FPGAs). However, due to their high memory-demand, most of these implementations inhibit a large resource utilization. In this paper, we propose a hybrid software-hardware architecture of CCA for an industrial application using Xilinx Zynq-7000 All Programmable System on Chip (SoC). By offloading the most resource consuming part of the algorithm to the embedded CPU, we achieved high performance, while reducing the required resources on the FPGA. Our proposed architecture saves more than 30% of on-chip memory (Block RAMs) compared to state-of-the-art hardware architectures without affecting the throughput. Furthermore, due to the embedded CPU, our system provides a versatile and highly flexible feature extraction at run-time without the necessity to reconfigure the FPGA.

Download Paper (PDF; Only available from the DATE venue WiFi)

**WORKLOAD DEPENDENT RELIABILITY TIMING ANALYSIS FLOW**

Speaker: Ajith Sivadasan, TIMA Labs, FR
Authors: Ajith Sivadasan¹, Armelle Notin¹, Vincent Huard¹, Etienne Maurin¹, Florian Cacho², Sidi Ahmed Benhassain³ and Lorena Anghel³
¹TIMA Labs, FR; ²STMicroelectronics, FR; ³TIMA, FR; ⁴Grenoble-Alpes University, FR

Abstract
Silicon measurements indicate the fact that the frequency limiting paths change as per aging and as a function of workload. This paper proposes a simulation flow that leads to the identification of such paths. Gate-level models provide an accurate estimate of aging of the critical paths by taking into consideration the stress experienced by corresponding standard cells for a given workload on the digital circuit and thereby providing a more accurate estimate of circuit aging.

Download Paper (PDF; Only available from the DATE venue WiFi)
The session provides an overview of recent advances in model-based design of embedded real-time systems. The first paper proposes an optimal deployment for data-flow control systems in the context of event-based real-time simulation. This session is chaired by Alain Girault from INRIA, FR.

### Time | Label | Presentation Title |
--- | --- | --- |
11:45 | 6.6.4 | PROBABILISTIC TIMING ANALYSIS ON TIME-RANDOMIZED PLATFORMS FOR THE SPACE DOMAIN |

#### Abstract
Timing Verification is a fundamental step in real-time embedded systems, with measurement-based timing analysis (MBTA) being the most common approach used to that end. We present a Space case study on a real platform that has been modified to support a probabilistic variant of MBTA called MBPTA. Our platform provides the properties required by MBPTA with the predicted WCET estimates with MBPTA being competitive to those with current MBTA practice while providing more solid evidence on their correctness for certification.

Download Paper (PDF; Only available from the DATE venue WiFi)

12:00 | 6.6.5 | CROSS-LAYER DESIGN OF RECONFIGURABLE CYBER-PHYSICAL SYSTEMS |

#### Abstract
In the last few years, besides the concepts of embedded and interconnected systems, also the notion of Cyber-Physical Systems (CPS) has emerged: embedded computational collaborating devices, capable of sensing and controlling physical elements and, often, responding to humans. The continuous interaction between the physical and the computing layers makes their design and maintenance extremely complex. Uncertainty management and runtime reconfigurability, to mention the most relevant ones, are rarely tackled by available commercial and academic toolchains. In this context, the Cross-layer modell-based framework for multi-objective design of Reconfigurable systems in uncertain environments (CERBERO) EU project aims at developing a design environment for CPS based of two pillars: 1) a cross-layer model-based approach to describe, optimize, and analyze the system and all its different views concurrently and 2) an advanced adaptivity support based on a multi-layer autonomous engine. In this work, we describe the necessary components and the required developments for seamless design of reusable and reconfigurable CPS and Systems of Systems in uncertain hybrid environments.

Download Paper (PDF; Only available from the DATE venue WiFi)

12:15 | 6.6.6 | INSPEX: DESIGN AND INTEGRATION OF A PORTABLE/WEARABLE SMART SPATIAL EXPLORATION SYSTEM |

#### Abstract
The INSPEX H2020 project main objective is to integrate automotive-equivalent spatial exploration and obstacle detection functionalities into a portable/wearable multi-sensor, miniaturised, low power device. The INSPEX system will detect and localise in real-time static and mobile obstacles under various environmental conditions in 3D. Potential applications range from safer human navigation in reduced visibility, small robot/drone obstacle avoidance systems to navigation for the visually/mobility impaired, this latter being the primary use-case considered in the project.

Download Paper (PDF; Only available from the DATE venue WiFi)

12:30 | 6.7 | END OF SESSION |

#### Lunch Break
Lunch Break in the Garden Foyer

Lunch Keynote session 7.0 in “Garden Foyer” 1350 – 1420

Lunch Break in the Garden Foyer

On all conference days (Tuesday to Thursday), a buffet lunch will be offered in the Garden Foyer, in front of the session rooms. Kindly note that this is restricted to conference delegates possessing a lunch voucher only. When entering the lunch break area, delegates will be asked to present the corresponding lunch voucher of the day. Once the lunch area is being left, re-entry is not allowed for the respective lunch.
11:00 6.7.1 NEAR-OPTIMAL DEPLOYMENT OF DATAFLOW APPLICATIONS ON MANY-CORE PLATFORMS WITH REAL-TIME GUARANTEES
Speaker: Stefanos Skalitsis, École Polytechnique Fédérale de Lausanne (EPFL), GR
Authors: Stefanos Skalitsis and Alena Simalatar, EPFL, CH
Abstract: Safe and optimal deployment of data-streaming applications on many-core platforms requires the realistic estimation of task Worst-Case Execution Time (WCET). On the other hand, task WCET depends on the deployment solution, due to the varying number of interferences on shared resources, thus introducing a cyclic dependency. Moreover, WCET is still an over-approximation of the Actual Execution Time (AET), thus leaving room for run-time optimisation. In this paper we introduce an offline/online optimisation approach. In the offline phase, we first break the cyclic dependency and acquire safe and near-optimal solutions for tasks partitioning/placement, mapping, scheduling and buffer allocation. Then, we tighten the WCETs and update the scheduling function accordingly. In the online phase we introduce a safe distributed readjustment of the offline schedule, based on the AET. Experiments on a Kalray MPPA-256 platform show a tightening of the guaranteed latency up to 46% in the offline phase, and 41% latency reduction in the online phase. In total, we achieve more than 50% of latency reduction.
Download Paper (PDF; Only available from the DATE venue WiFi)

11:30 6.7.2 SIMULATING PREEMPTIVE SCHEDULING WITH TIMING-AWARE BLOCKS IN SIMULINK
Speaker and Author: Andreas Naderlinger, University of Salzburg, AT
Abstract: This paper introduces an extension of the modeling and simulation environment MATLAB/Simulink. It enables control and system engineers to consider software execution times, as well as the effects of scheduling and preemption inside software-in-the-loop (SIL) simulations. To this end, we present the concept of a Simulink block whose execution lasts for a finite amount of simulation time. During this time, the simulation engine continues to update the plant or other blocks with outputs that have already been calculated by the block. Execution time information is assumed to be known (or based on some random distribution). Source-level annotating the control software with target specific timing information enables a fine-grained and even a control-flow dependent simulation of the block. We outline the required synchronization with the simulation engine of Simulink. This timing-aware block consumes simulation time in the same sense as a task consumes CPU time on a target. We describe a mechanism to execute a set of such blocks with (potentially cyclic) data dependencies with a static priority scheduler inside Simulink, including support for preemption. The presented approach permits a development process, where a typical time invariant and platform agnostic model is incrementally transformed into a platform-specific one that makes the simulation more realistic.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:00 6.7.3 ONLINE WORKLOAD MONITORING WITH THE FEEDBACK OF ACTUAL EXECUTION TIME FOR REAL-TIME SYSTEMS
Speaker: Biao Hu, Tech. Univ. Muenchen TUM, DE
Authors: Biao Hu¹, Kai Huang², Gang Chen¹, Long Cheng¹ and Alois Knoll¹
¹Tech. Univ. Muenchen TUM, DE; ²Sun Yat-Sen University, CN
Abstract: Guaranteeing the system workload within design bounds is a basic requirement for a real-time system. Design-time bounds are usually based on worst-case activation patterns and worst-case execution time. While using the worst-case assumptions for online monitoring can guarantee the system safety, it also introduces unexplored slacks due to tasks consuming less than their worst-case execution times. In this paper, we introduce a monitoring scheme with the feedback of actual execution time for real-time systems. By using this runtime feedback instead of offline assumptions, this monitoring scheme can accept events that are considered as violations offline, and thereby improve the system utilization. In the experiments of both MATLAB simulation and MicroC/OS-II running in a softcore processor implemented on an FPGA, different probability distributions of actual execution time are used in analyzing how much the benefit can be gained from the feedback scheme.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30 6.7.4 LATENCY ANALYSIS OF HOMOGENOUS SYNCHRONOUS DATAFLOW GRAPHS USING TIMED AUTOMATA
Speaker: Guus Kuiper, University of Twente, NL
Authors: Guus Kuiper¹ and Marco Bekooij²
¹University of Twente, NL; ²University of Twente + NXP semiconductors, NL
Abstract: There are several analysis models and corresponding temporal analysis techniques for checking whether applications executed on multiprocessor systems meet their real-time constraints. However, currently there does not exist an exact end-to-end latency analysis technique for Homogeneous Synchronous Dataflow (HSDF) with Auto-concurrency (HSDFa) models that takes the correlation between the firing durations of different firings into account. In this paper we present a transformation of strongly connected (HSDFa) models into timed automata models. This enables an exact end-to-end latency analysis because the correlation between the firing durations of different firings is taken into account. In a case study we compare the latency obtained using timed automata and a Linear Program (LP) based analysis technique that relies on a deterministic abstraction and compare their run-times as well. Exact end-to-end latency analysis results are obtained using timed automata, whereas this is not possible using deterministic timed-dataflow models.
Download Paper (PDF; Only available from the DATE venue WiFi)

End of session
Lunch Break in Garden Foyer
Keynote Lecture session 7.0 in "Garden Foyer" 1350 – 1420
Lunch Break in the Garden Foyer
On all conference days (Tuesday to Thursday), a buffet lunch will be offered in the Garden Foyer, in front of the session rooms. Kindly note that this is restricted to conference delegates possessing a lunch voucher only. When entering the lunch break area, delegates will be asked to present the corresponding lunch voucher of the day. Once the lunch area is being left, re-entrance is not allowed for the respective lunch.

6.8 HIPEAC: European Network on High Performance and Embedded Architecture and Compilation
Date: Wednesday 29 March 2017
Time: 11:00 - 12:30
Location / Room: Exhibition Theatre
Organiser: Catherine Roderick, Barcelona Supercomputing Center, ES
Moderator: Luca Fanucci, University of Pisa, IT
This session will showcase the activities of this network of research expertise. HIPEAC members come from both industry and academia and, together, form a community of expertise in Europe which reinforces and strengthens R&D activities. We offer funding for industrial PhD internships and short-term collaborations between early-career researchers and other research centres, as well as annual Tech Transfer Awards and communications and recruitment services. Annual HIPEAC activities include a high-profile conference, a
researcher summer school and two Computing Systems Weeks, which are networking and knowledge-exchange gatherings. We also produce a biennial technology roadmap, the HiPEAC Vision, which recommends future actions and priorities for the European computing systems community and is a key source of reference for the European Commission. In this session, after a brief introduction to HiPEAC, we highlight some of our members’ innovative and groundbreaking research and development activities.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>11:00</td>
<td>6.8.1</td>
<td>ACCELERATED DATA CENTERS FOR CLOUD COMPUTING: THE VINEYARD PLATFORM</td>
<td>Dimitrios Soudris, National Technical Univ. of Athens and ICCS, GR</td>
</tr>
</tbody>
</table>

**Abstract**

VINEYARD aims to develop the technology and the ecosystem that will enable the efficient integration of the hardware acceleration in the data centres, seamlessly. The deployment of energy-efficient hardware accelerators will be used to improve significantly the performance of cloud computing applications and reduce the energy consumption in data centres.

VINEYARD is developing an integrated framework for energy-efficient data centres based on programmable hardware accelerators. It is working towards a high-level programming framework that allows end-users to seamlessly utilize these accelerators in heterogeneous computing systems by using typical data-centre cluster frameworks (i.e. Spark). VINEYARD is also developing two types of novel energy-efficient servers integrating two kinds of hardware accelerators: programmable dataflow-based accelerators and FPGA-based accelerators. The servers coupled with dataflow-based accelerators are suitable for cloud computing applications that can be represented in dataflow graphs while the latter will be used for accelerating applications that need tight communication between the processor and the hardware accelerators.

VINEYARD also fosters the establishment of an ecosystem that will empower open innovation based on hardware accelerators as data-centre plugins, thereby facilitating innovative enterprises (large industries, SMEs, and creative start-ups) to develop novel solutions using VINEYARD’s leading edge developments.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>11:15</td>
<td>6.8.2</td>
<td>HIGH-PERFORMANCE PARALLELISATION OF REAL-TIME APPLICATIONS WITH THE UPSCALE SDK</td>
<td>Luis Miguel Pinho, Polytechnic of Porto, PT</td>
</tr>
</tbody>
</table>

**Abstract**

Nowadays, the prevalence of computing systems in our lives is so ubiquitous that it would not be far-fetched to state that we live in a cyber-physical-world dominated by computer systems. These systems demand for more and more computational performance to process large amounts of data from multiple data sources, some of them with guaranteed processing response times. In other words, systems are required to deliver their results within pre-defined (and sometimes extremely short) time bounds. Examples can be found for instance in intelligent transportation systems for fuel consumption reduction in cities or railway, or autonomous driving of vehicles.

To cope with such performance requirements, chip designers produced chips with dozens or hundreds of cores, interconnected with complex networks on chip. Unfortunately, the parallelization of the computing activities brings many challenges, among which how to provide timing guarantees, as the timing behaviour of the system running within a many-core processor depends on interactions on shared resources that are most of the time not known by the system designer.

P-SOCRATES (Parallel Software Framework for Time-Critical Many-core Systems) is an FP7 European project, which developed a novel methodology to facilitate the deployment of standardized parallel architectures for real-time applications. This methodology was implemented (based on existing models and components) to provide an integrated software development kit, the Upscale SDK, to fully exploit the huge performance opportunities brought by the most advanced many-core processors, whilst ensuring a predictable performance and maintaining (or even reducing) development costs of applications. The presentation will provide an overview of the Upscale SDK, its underlying methodology, and the results of its application on relevant industrial use-cases.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>11:30</td>
<td>6.8.3</td>
<td>POWER-AWARE SOFTWARE MAPPING OF PARALLEL APPLICATIONS ONTO HETEROGENEOUS MPSOCs</td>
<td>Gereon Onnebrink, RWTH Aachen University, DE</td>
</tr>
</tbody>
</table>

**Abstract**

With the ever-increasing need of computational power, heterogeneous multi- and many-processor SoCs provide the best trade-off between performance, cost, and power. However, one of the biggest hurdles to exploit multicore architectures from the SW side is how to efficiently develop performance and power co-optimised parallel applications. Making the right decisions in the vast SW design space can hardly be done by the programmer in a reasonable time frame, especially, when performing a manual design process. Considering an application that has been properly partitioned into multiple concurrent tasks, and programmed in a parallel language, the process of mapping those tasks onto the processors with the optimal voltage and frequency setting is a huge challenge for a certain design goal. An automatic approach is needed that determines the optimal decision, given an optimisation constraint. A great amount of research has been conducted at ICE aiming to optimise the performance of a parallelised application.

P-SOCRATES (Parallel Software Framework for Time-Critical Many-core Systems) is an FP7 European project, which developed a novel methodology to facilitate the deployment of standardized parallel architectures for real-time applications. This methodology was implemented (based on existing models and components) to provide an integrated software development kit, the Upscale SDK, to fully exploit the huge performance opportunities brought by the most advanced many-core processors, whilst ensuring a predictable performance and maintaining (or even reducing) development costs of applications. The presentation will provide an overview of the Upscale SDK, its underlying methodology, and the results of its application on relevant industrial use-cases.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>11:45</td>
<td>6.8.4</td>
<td>OVERVIEW OF MANGO: EXPLORING MANYCORE ARCHITECTURES FOR NEXT-GENERATION HPC SYSTEMS</td>
<td>José Flich, Technical University of Valencia, ES</td>
</tr>
</tbody>
</table>

**Abstract**

The performance/power efficiency wall poses the major challenge faced nowadays by HPC. Looking straight at the heart of the problem, the hurdle to the full exploitation of today computing technologies ultimately lies in the gap between the applications’ demand and the underlying computing architecture: the closer the computing system matches the structure of the application, the more efficiently the available computing power is exploited. Consequently, enabling a deeper customization of architectures to applications is the main pathway towards computation power efficiency. In addition to mere performance and power-efficiency, it is of paramount importance to meet new nonfunctional requirements posed by emerging classes of applications. In particular, a growing number of HPC applications demand some form of time-predictability, or more generally Quality-of-Service (QoS), particularly in those scenarios where correctness depends on both performance and timing requirements and the failure to meet either of them is critical.

The MANGO project builds on these considerations and will set inherent architecture-level support for application-based customization as one of its underlying pillars. In addition, an heterogeneous platform for HPC architecture exploration will be deployed.
GREENOPENHEVC: LOW POWER HEVC DECODER

Presenter: Menard Daniel, INSA Rennes, FR
Authors: Julien Heulet¹, Erwan Nogue², Maxime Pelcat² and Wassim Hamidouche³

Abstract
Video on mobile devices is a must-have feature with the prominence of new services and applications using video like streaming or conferencing. The new video standard HEVC is an appealing technology for service providers. Besides, with the recent progress of SoC, software video decoders are now a reality. The challenge is to provide power efficient design to fit with the compelling demand for long battery. We present here a practical set-up demonstrating that the new HEVC standard can be implemented in software on an embedded GPP multicore platform. Different techniques have been integrated to optimize the energy: data-level and thread level parallelisms, video aware Dynamic Voltage and Frequency Scaling. To push back the limits, algorithm level approximate computing is carried-out on the in-loop filtering. The subjective tests have demonstrated that the quality degradation is almost imperceptible. A mean power of less than 1 Watt is reported for a HD 1080p/24fps video decoding.

More information ...

NOXIM-XT: A BIT-ACCURATE POWER ESTIMATION SIMULATOR FOR NOCS

Presenter: Pierre Bomel, Université de Bretagne Sud, FR
Authors: André Rossi¹, Johann Laurent² and Erwan Moreac³

Abstract
We have developed an enhanced version of Noxim (Noxim-XT) to estimate the energy consumption of a NoC in a SOC. Noxim-XT is used in a two-step methodology. First, applications are mapped on a SoC and their traffics are extracted by simulation with MPSOdBench. Second, Noxim-XT tests various hardware configurations of the NoC, and for each configuration, the application’s traffic is re-injected and replayed, an accurate performance and power breakdown is provided, and the user can choose different data coding strategies. With the help of Noxim-XT, each configuration is bit-accurately estimated in terms of energy consumption. After simulation, a spatial mapping of the energy consumption is provided and highlights the hot-spots. Moreover, the new coding strategies allows significant energy saving. Noxim XT simulations and a FPGA-based prototype of a new coding strategy will be demonstrated at the U-booth to illustrate these works.

More information ...

EYES OF THINGS

Speaker: Matteo Sorci, nVISO, CH

Abstract
Currently, computer vision is rapidly moving beyond academic research and factory automation. With the appropriate platforms and tools, the emerging possibilities are endless in terms of wearable applications, augmented reality, surveillance, ambient-assisted living, etc.

Vision, our richest sensor, allows mining big data from reality. While the number of image sensors deployed across all products in the world is a small fraction of the total number of sensors deployed, the amount of data generated by them dwarfs the amount of data generated by all other types of sensors combined. This has a cost, vision is arguably the most demanding sensor in terms of power consumption and required processing power.

Our objective in this project is to build a power-size-cost-programmability optimized core vision platform that can work independently and also embedded into all types of artefacts. The envisioned open hardware is being combined with carefully designed APIs that maximize inferred information per milliwatt and adapt the quality of inferred results to each particular application. This will not only mean more hours of continuous operation, it will allow to create novel applications and services that go beyond what current vision systems can do, which are either personal/mobile or "always-on" but not both at the same time.

12:30 End of session
Lunch Break in Garden Foyer

Keynote Lecture session 7.0 in "Garden Foyer" 1350 - 1420

Lunch Break in the Garden Foyer
On all conference days (Tuesday to Thursday), a buffet lunch will be offered in the Garden Foyer, in front of the session rooms. Kindly note that this is restricted to conference delegates possessing a lunch voucher only. When entering the lunch break area, delegates will be asked to present the corresponding lunch voucher of the day. Once the lunch area is being left, re-entrance is not allowed for the respective lunch.
More information ...

---

More information ...

---

More information ...

---

More information ...

---

More information ...

---

More information ...

---

More information ...

---
SEFILE: A SECURE FILESYSTEM IN USERSPACE VIA SECUBE™

Presenter:
Giuseppe Airofarulla, CINI, IT

Authors:
Paolo Prinetto¹ and Antonio Varriale²
¹CINI & Politecnico di Torino, IT; ²Blu5 Labs Ltd., IT

Abstract
The SEcube™ Open Source platform is a combination of three main cores in a single-chip design. Low-power ARM Cortex-M4 processor, a flexible and fast Field-Programmable-Gate-Array (FPGA), and an EAL5+ certified Security Controller (SmartCard) are embedded in an extremely compact package. This makes it a unique Open Source security environment where each function can be optimized, executed, and verified on its proper hardware device. In this demo, we present a Windows wrapper for a Filesystem in Userspace (FUSE) with an HDD firewall resorting to the hardware built-in capabilities, and the software libraries, of the SEcube™.

LABSMILING: A FRAMEWORK, COMPOSED OF A REMOTELY ACCESSIBLE TESTBED AND RELATED SW TOOLS, FOR ANALYSIS AND DESIGN OF LOW DATA-RATE WIRELESS PERSONAL AREA NETWORKS BASED ON IEEE 802.15.4

Presenter:
Marco Santic, University of L’Aquila, IT

Authors:
Luigi Pomante, Walter Tiberti, Carlo Centofanti and Lorenzo Di Giuseppe, DEWS - Università di L’Aquila, IT

Abstract
Low data-rate wireless personal area networks (LR-WPANs) are even more present in the fields of IoT, wearable devices and health monitoring. The development, deployment and test of such systems, based on IEEE 802.15.4 standard (and its derivations, e.g. 15.4e), require the exploitation of a testbed when the network is not trivial and grows in complexity. This demo shows the framework of LabSmiling: a testbed and related SW tools that connect a meaningful (but still scalable) number of physical devices (sensor nodes) located in a real environment. It offers the following services: program, reset, switch on/off single devices; connect to devices up/down links to inject or receive commandsmsgs/packets in/from the network; set devices as low level packet sniffers, allowing to test/debug protocol compliances or extensions. Advanced services are: possibility of design test scenarios for the evaluation of network metrics (throughput, latencies, etc.) and custom application verification.

INTERNET OF EVERYTHING IS OUR OPPORTUNITY

Author:
Keith Willett, Director of Software Engineering for Merck Serono, CH

Abstract
Merck Serono is working to revolutionize patient care and doctor assist through utilization of technology that is built of the Internet of Everything. Using global resources to consolidate medical devices under a single platform that will store, analyze and recommend patient care to physicians, Merck is leveraging the Internet of Everything to improve patient care. The IoT is not limited to medical devices, as everything from automobiles to light bulbs are looking for ways to connect to the Internet. These devices gather, store and analyze data to improve the user experience and create value for people and businesses that have yet to be recognized. However, connecting so many products will cause an increased strain on the network infrastructures, and most importantly expose personal information to potential threats; if not managed correctly. All companies connecting devices are having similar problems and are working to solve these issues. As the Internet of Everything continues to evolve, critical strategies will need to be in place for all companies to be successful. This presentation will discuss the strategies companies need to play in this space and how collaboration and cooperation will become more common in IoT.
NETWORKED LABS-ON-CHIPS

UB07 Session 7
Date: Wednesday 29 March 2017
Time: 14:00 - 16:00
Location / Room: Booth 1, Exhibition Area

<table>
<thead>
<tr>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>UB07.1</td>
<td>COSSIM: A NOVEL, COMPREHENSIBLE, ULTRA-FAST, SECURITY-AWARE CPS SIMULATOR</td>
<td>Nikolaos Tampouratzis, Technical University of Crete, GR</td>
</tr>
<tr>
<td></td>
<td><strong>Presenter:</strong></td>
<td>Antonios Nikitakis and Andreas Brokalakis, Synelixis Solutions Ltd, GR</td>
</tr>
<tr>
<td></td>
<td><strong>Abstract</strong></td>
<td>One of the main problems Cyber Physical Systems (CPS) and Highly Parallel Systems (HPS) designers face is the lack of simulation tools and models for system design and analysis. This is mainly because the majority of the existing simulation tools can handle efficiently only parts of a system (e.g. only the processing or only the network) while none of them supports the notion of security. Moreover, most of the existing simulators need extreme amounts of processing resources while faster approaches cannot provide the necessary precision and accuracy. COSSIM is an open-source framework that seamlessly simulates, in an integrated way, the networking and the processing parts of the CPS and Highly Parallel Heterogeneous Systems. In addition, COSSIM supports accurate power estimations while it is the first such tool supporting security as a feature of the design process. The complete COSSIM framework together with its sophisticated GUI will be presented. More information ...</td>
</tr>
</tbody>
</table>

UB07.2 | RZMEDI: WHEELCHAIR MOUNTED ROBOTIC ARM DEMONSTRATOR FOR PEOPLE WITH MOTOR SKILLS IMPAIRMENTS          | Alessandro Palla, University of Pisa, IT          |
|        | **Presenter:**                                                                                       | Gabriele Meoni and Luca Fanucci, University of Pisa, IT |
|        | **Abstract**                                                                                         | People with reduced mobility experiment many issues in the interaction with the indoor and outdoor environment because of their disability. For those users even the simplest action might be a hard/impossible task to perform without the assistance of an external aid. We propose a simple and lightweight wheelchair mounted robotic arm with the focus on the human-machine interface that has to be simple and accessible for users with different kind of disabilities. The robotic arm is equipped with a 5 MP camera, force and proximity sensors and a 6 axis Inertial Measurement Unit on the end-effector that can be controlled using an app running on a tablet. When the user selects the object to reach (for instance a button) on the tablet screen, the arm autonomously carries out the task, using the camera image and the sensors measurements for autonomous navigation. The demonstrator consists in the robotic arm prototype, the Android tablet and a personal computer for arm setup and configuration. More information ... |

UB07.3 | FLEXPORT: FLEXIBLE PLATFORM FOR OBJECT RECOGNITION & TRACKING TO ENHANCE INDOOR LOCALIZATION AND MAPPING | Marko Rößler, Technische Universität Chemnitz, DE |
|        | **Presenter:**                                                                                       | Christian Schott, Murali Padmanabha and Ulrich Heinkel, TU Chemnitz, DE |
|        | **Abstract**                                                                                         | Object detection plays a crucial role in realizing intelligent indoor localization and mapping techniques. With the advantages of these techniques comes the complexity of computing hardware and the mobility. While the availability of open source computer vision algorithms and High-Level-Synthesis framework accelerates the development, the hybrid processing architecture of an All Programmable System on Chip (APSoC) enables efficient hardware-software partitioning. Using these tools, a generic platform was designed for evaluating the computer vision algorithms. Open source components such as Linux kernel and OpenCV libraries were integrated for evaluation of the algorithms on the software while Vivado HLS framework was used to synthesize the hardware counter parts. Algorithms such as Sobel filtering and Hough Line transformation were implemented and analyzed. The capabilities of this platform were used to realize a mobile object detection system for enhancing the localization techniques. More information ... |

UB07.4 | NETWORKED LABS-ON-CHIPS                                                                               | Werner Haselmayr, Andreas Springer and Robert Wille, Johannes Kepler University Linz, AT |
|        | **Presenter:**                                                                                       | Andreas Grimm, Johannes Kepler University Linz, AT |
|        | **Abstract**                                                                                         | Labs-on-Chip (LoC) allow for the miniaturization, integration, and automation of medical and bio-chemical procedures. In recent years, different technologies have been considered. However, all of them have their drawbacks, e.g. electrowetting-based LoCs suffer from the evaporation of liquids, the fast degradation of the surface coatings, and the inferior biocompatibility, while flow-based LoCs require a complex and costly multilayer fabrication process. Hence, an alternative has recently been proposed in terms of Networked Labs-on-Chips. We present and demonstrate the NLoc technology where so-called droplets flow inside channels of micrometer-size. Networking functionalities enable the designer to dynamically select the operations to be conducted. These networking functionalities exploit hydrodynamic forces acting on droplets. Moreover, NLoc devices can be produced at low cost (e.g. using 3D printers). By this, drawbacks of established LoC-technologies are addressed. More information ... |
SCCHARTS: SYNCHRONOUS STATECHARTS FOR SAFETY-CRITICAL APPLICATIONS
Presenter:
Reinhard von Hanxleden, Kiel University, DE
Authors:
Michael Mendler, Christian Motika, Christoph Daniel Schulze and Steven Smyth
1Bamberg University, DE; 2Kiel University, DE
Abstract
We present a visual language, SCCharts, designed for specifying safety-critical reactive systems. SCCharts use a statechart notation and provide determinate concurrency based on a synchronous model of computation (MoC), without restrictions common to previous synchronous MoCs. Specifically, we lift earlier limitations on sequential accesses to shared variables, by leveraging the sequentially constructive MoC. For further details, see [von Hanxleden et al., PLDI'14] and http://www.scccharts.com. The SCCharts demonstrator is an Eclipse Rich Client and part of KIELER (http://www.rtsys.informatik.uni-kiel.de/en/research/kieler). The demonstration shows how to write an SCChart model using a textual notation, from which a visual model is generated on the fly using the Eclipse Layout Kernel (ELK). We also present a compilation chain that allows efficient synthesis of software and hardware.
More information ...

GNOCS: AN ULTRA-FAST, HIGHLY EXTENSIBLE, CYCLE-ACCURATE GPU-BASED PARALLEL NETWORK-ON-CHIP SIMULATOR
Presenter:
Amir CHARIF, TIMA, FR
Authors:
Nacer-Eddine Zergainoh and Michael Nicolaides, TIMA, FR
Abstract
With the continuous decrease in feature sizes and the recent emergence of 3D stacking, chips comprising thousands of nodes are becoming increasingly relevant, and state-of-the-art NoC simulators are unable to simulate such a high number of nodes in reasonable times. In this demo, we showcase GNOCS, the first detailed, modular and scalable parallel NoC simulator running fully on GPU (Graphics Processing Unit). Based on a unique design specifically tailored for GPU parallelism, GNOCS is able to achieve unprecedented speedups with no loss of accuracy. To enable quick and easy validation of novel ideas, the programming model was designed with high extensibility in mind. Currently, GNOCS accurately models a VC-based microarchitecture. It supports 2D and 3D mesh topologies with full or partial vertical connections. A variety of routing algorithms and synthetic traffic patterns, as well as dependency-driven trace-based simulation (Nettrace), are implemented and will be demonstrated.
More information ...

PER: METHOD AND TOOL FOR ANALYZING THE INTERPLAY BETWEEN PERFORMANCE, ENERGY AND SCALING IN MULTI- AND MANY-CORE PLATFORMS
Presenter:
Fei Xia, Newcastle University, GB
Authors:
Ashur Rafiee, Alexander Romanovsky and Alex Yakovlev, Newcastle University, GB
Abstract
Parallelization has been used to maintain a reasonable balance between energy consumption and performance in computing systems. However, the effectiveness of parallelization scaling is different for different hardware platforms. This is because the reliable operation region (ROR), a region defined in the voltage-throughput space for any hardware platform, is platform-dependent and its shape determines how effective parallelization scaling is in improving throughput and/or reducing power consumption. Although many of the interlinked issues are known, a unifying analysis method has just now been proposed to study the interplay between performance, energy, reliability and parallelization scaling. The method of bi-normalization of the ROR is designed to help achieve a meaningful cross-platform analysis of this interplay. The PER tool brings all these issues together and helps designers reason about hardware parallelization, DVFS and software parallelizability.
More information ...

SELINK: SECURING HTTP AND HTTPS-BASED COMMUNICATION VIA SECUBE™
Presenter:
Airofurla Giuseppe, CINI & Politecnico di Torino, IT
Authors:
Paolo Prinetto
1
Authors:
Fei Xia, Newcastle University, GB
1
Authors:
Ashur Rafiee, Alexander Romanovsky and Alex Yakovlev, Newcastle University, GB
Abstract
The SEcube™ Open Source platform is a combination of three main cores in a single-chip design. Low-power ARM Cortex-M4 processor, a flexible and fast Field-Programmable-Gate-Array (FPGA), and an EAL5+ certified Security Controller (SmartCard) are embedded in an extremely compact package. This makes it a unique Open Source security environment where each function can be optimized, executed, and verified on its proper hardware device. In this demo, we present a client-server HTTP and HTTPS-based application, for which the traffic is encrypted resorting to the hardware built-in capabilities, and the software libraries, of the SEcube™. By doing so, we show how communication can be secured from an attacker capable of inspecting, and tampering, the regular communication.
More information ...

STACKADROP: A MODULAR DIGITAL MICROFLUIDIC BIOCHIP RESEARCH PLATFORM
Presenter:
Oliver Keszeöce, University of Bremen, DE
Authors:
Maximilian Luernert and Rolf Drechsler, University of Bremen & DKI GmbH, DE
Abstract
Advances in microfluidic technologies have led to the emergence of Digital Microfluidic Biochips (DMFBs), which are capable of automating laboratory procedures. These DMFBs raise significant attention in industry and academia creating a demand for devices. Commercial products are available but come at a high price. So far, there are two open hardware DMFBs available: the DropBot from WheelerLabs and the OpenDrop from GaudiLabs. The aim of the StackADrop was to create a DMFB with many directly addressable cells while still being very compact. The StackADrop strives to provide means to experiment with different hardware setups. It’s main feature are the exchangeable top plates, supporting 256 high-voltage pins. It features SPI, UART and I2C connectors for attaching sensors/actuators and can be connected to a computer using USB for interactive sessions using a control software. The modularity allows to easily test different cell shapes, such as squares, hexagons and triangles.
More information ...

PULP: A ULTRA-LOW POWER PLATFORM FOR THE INTERNET-OF-THINGS
Presenter:
Francesco Conti, ETH Zurich, CH
Authors:
Stefan Mach, Florian Zarubia, Antonio Pullini, Daniele Palossi, Giovanni Rovere, Florian Glaser, Germain Haugou, Schekeb Fateh and Luca Benini
1ETH Zurich, CH; 2ETH Zurich, CH and University of Bologna, IT
Abstract
The PULP (Parallel Ultra-Low Power) platform strives to provide high performance for IoT nodes and endpoints within a very small power envelope. The PULP platform is based on a tightly-coupled multi-core cluster and on a modular architecture, which can support complex configurations with autonomous I/O without SW intervention, HW-accelerated execution of hot computation kernels, fine-grain event-based computation - but can also be deployed in very simple configuration, such as the open source PULPino microcontroller. In this demonstration booth, we will showcase several prototypes using PULP chips in various configuration. Our prototypes perform demos such as real-time deep-learning based visual recognition from a low-power camera, and online biosignal acquisition and reconstruction on the same chip. Application scenarios for our technology include healthcare wearables, autonomous nano-UAVs, smart networked environmental sensors.
More information ...
7.1 IoT Day Hot Topic Session: IoT Deployment

Date: Wednesday 29 March 2017
Time: 14:30 - 16:00
Location / Room: SBC

Organisers: Marilyn Wolf, Georgia Tech, US
Andreas Herkersdorf, TU Muenchen, DE

Chair: Marilyn Wolf, Georgia Tech, US
Co-Chair: Andreas Herkersdorf, TU Muenchen, DE

IoT technologies have the potential to be a disruptive game changer for existing applications and services as well as an enabler for new businesses. This session provides viewpoints from industry as well as a startup company on the deployment and evolution of IoT-oriented services and products.

Time | Label | Presentation Title | Authors
--- | --- | --- | ---
14:30 | IP7.1.1, 7107 | A LOW-POWER IOT PROCESSOR INTEGRATING VOLTAGE-SCALABLE FULLY DIGITAL MEMORIES | Hidetoshi Ondotera, Kyoto University, JP
14:37 | IP7.1.2, 7108 | A SIMPLE, STATELESS, COST EFFECTIVE SYMMETRIC CRYPTOGRAPHY STRATEGY FOR ENERGY-HARVESTING IOT DEVICES | Jan Madsen, Technical University of Denmark, DK
14:44 | IP7.1.3, 7109 | RECONFIGURABLE MICROCONTROLLER FOR END NODES IN INTERNET OF THINGS | Wai-Chung Matthew Tang, Queen Mary University of London, GB
14:51 | IP7.1.4, 7110 | FURTHER SIMPLIFICATION OF APPROXIMATE ADDERS USING INPUT DATA RANGES IN IOT | Jeong-A Lee, Chosun University, KR
15:00 | 7.1.2 | HOW ASIC DEVELOPMENT WILL CHANGE FOR FUTURE IOT MEMS SENSORS | Dirk Droste, Robert Bosch GmbH, DE
Author:
The global ASIC community faces a strong trend towards new IoT applications - but, what is the concrete behind all fuzzy discussions for the ASIC design community? This talk will give an overview about the perspective of Bosch Sensortec: ASIC development to adapt to upcoming challenges in ASIC design for future IoT MEMS sensors with their broad span of new applications and features and their challenging requirements for low power, high performance and complex integration.

15:30 | 7.1.3 | DISTRIBUTED WAYSIDE ARCHITECTURE - IOT FOR RAILWAY INFRASTRUCTURE | Peter Hefti, Siemens, CH
Speaker:
Olivier Kaiser, Siemens, CH
Author:
Rice highway infrastructure is characterized by very long life cycles, e.g. 25 years or even more, and very harsh environmental conditions. The requirements for availability and safety are nonetheless very demanding to assure an efficient and safe operation. In addition, the fulfillment of these requirements has to be shown formally in so-called safety cases. These cases have to be confirmed by independent safety assessors and eventually government agencies. Under these circumstances, the adoption of new technologies in the railway industry can be a challenge. Over the last decades, the architecture of railway control systems has been more or less stable. The trackside equipment, i.e. points, signals, track vacancy detection etc., is connected via star-shaped cabling to an interlocking. This interlocking distributes the energy and assures the safety by controlling the trackside equipment accordingly. The star-shaped cabling limits the control range of every interlocking, thus there is a need for an interlocking in every station. Both, this cabling concept as well as the large number of interlocking installations lead to high costs. To bring the overall costs down, new concepts have to be implemented. The field elements have to be connected via bus systems, ideally based on the Internet Protocol. This reduces cabling and increases the distance over which the elements can be controlled. Thus, the number of cabinets and installations can be distinctly reduced. Furthermore, off-the-shelf communication equipment can be used to connect the field elements. In the long run, a centralized operation of the control equipment in data centers can be envisioned. However, installing an internet of things along the track, where all signals, points and level crossings are subscribers, is demanding for the following reasons. • The functional safety has to be provided in a way that it can be formally proven. • A very high availability is necessary to assure steady operation. If an element or the connection to an element breaks down, no or only reduced operation is possible. • Security problems could affect passenger safety; hence, the communication system has to fulfill highest standards. • Legacy interfaces (e.g. the four wire interface for point machines) have to be supported further. • The field elements have to be provided with power. If a data bus is introduced, an adequate power bus is needed too in order to achieve substantial cost savings. For several years, Siemens has been working on innovating the IoT in the railway infrastructure. We named the concept Distributed Wayside Architecture. First installations at DB in Germany and SBB in Switzerland showed that the challenges mentioned above can be overcome. Current work focuses on the power bus as well as on the scalability of the concepts to larger installations.
new opportunities and challenges in terms of enhancing computational efficiency and ensuring security, respectively. This session explores in-memory computing applied to Non-volatile memories (NVMs) are playing an increasingly dominant role in the construction of energy-efficient systems thanks to reduced static power consumption. NVMs raise Non-volatile memories (NVMs) are playing an increasingly dominant role in the construction of energy-efficient systems thanks to reduced static power consumption. NVMs raise new opportunities and challenges in terms of enhancing computational efficiency and ensuring security, respectively. This session explores in-memory computing applied to emerging NVM technologies and goes on to investigate security and encryption strategies.

7.2 In-memory Computing and Security for Non-volatile Memory Technologies

Date: Wednesday 29 March 2017
Time: 14:30 - 16:00
Location / Room: 4BC
Chair:
Luca Amaru, Synopsys, US
Co-Chair:
Pierre-Emmanuel Gaillardon, University of Utah, US

Non-volatile memories (NVMs) are playing an increasingly dominant role in the construction of energy-efficient systems thanks to reduced static power consumption. NVMs raise new opportunities and challenges in terms of enhancing computational efficiency and ensuring security, respectively. This session explores in-memory computing applied to emerging NVM technologies and goes on to investigate security and encryption strategies.

7.2.1 AUTOMATED SYNTHESIS OF COMPACT CROSSBARS FOR SNEAK-PATH BASED IN-MEMORY COMPUTING

Speaker:
Sumit Kumar Jha, University of Central Florida, US
Authors:
Dwaipayan Chakraborty and Sumit Kumar Jha, University of Central Florida, US
Abstract
The rise of data-intensive computational loads has exposed the processor-memory bottleneck in Von Neumann architectures and has intensified the need for in-memory computing. Existing literature on computing Boolean formula using sneak-paths in nanoscale memristor crossbars has only focussed on short Boolean formula, such as 1-bit addition. There are two open questions: (i) Can one synthesize sneak-path based crossbars for computing large Boolean formula? (ii) What is the size of a memristor crossbar that can compute a given Boolean formula using sneak paths? In this paper, we make progress on both these open problems. First, we show that the number of rows and columns required to compute a Boolean formula is at most linear in the size of the Reduced Ordered Binary Decision Diagram representing the Boolean function. Second, we demonstrate how Boolean Decision Diagrams can be used to synthesize nanoscale crossbars that can compute a given Boolean formula using naturally occurring sneak paths. In particular, we synthesize large logical circuits such as 128-bit adders for the first-time using sneak-path based crossbar computing. Download Paper (PDF; Only available from the DATE venue WiFi)

15:00 7.2.2 HYBRID SPIKING-BASED MULTI-LAYERED SELF-LEARNING NEUROMORPHIC SYSTEM BASED ON MEMRISTOR CROSSBAR ARRAYS

Speaker:
Yiran Chen, Professor, US
Authors:
Amr Hassan, Chaofei Yang, Chenchen Liu, Hai (Helen) Li and Yiran Chen, University of Pittsburgh, US
Abstract
Neuromorphic computing systems are under heavy investigation as a potential substitute for the traditional von Neumann systems in high-speed low-power applications. Recently, memristor crossbar arrays were utilized in realizing spiking-based neuromorphic system, where memristor conductance values correspond to synaptic weights. Most of these systems are composed of a single crossbar layer, in which system training is done off-chip, using computer based simulations, then the trained weights are pre-programmed to the memristor crossbar array. However, multi-layered, on-chip trained systems become crucial for handling massive amount of data and to overcome the resistance shift that occurs to memristors overtime. In this work, we propose a spiking-based multi-layered neuromorphic computing system capable of online training. The system performance is evaluated using three different datasets showing improved results versus previous work. In addition, studying the system accuracy versus memristor resistance shift shows promising results. Download Paper (PDF; Only available from the DATE venue WiFi)

15:30 7.2.3 REVAMP : RERAM BASED VLIW ARCHITECTURE FOR IN-MEMORY COMPUTING

Speaker:
Anupam Chattopadhyay, School of Computer Science and Engineering, Nanyang Technological University, SG
Authors:
Debjyoj Bhattacharjee, Rajeswari Devadoss and Anupam Chattopadhyay, Nanyang Technological University, SG
Abstract
With diverse types of emerging devices offering simultaneous capability of storage and logic operations, researchers have proposed novel platforms that promise gains in energy-efficiency. Such platforms can be classified into two domains—application-specific and general-purpose. The application-specific in-memory computing platforms include machine learning accelerators, arithmetic units, and Content Addressable Memory (CAM)-based structures. On the other hand, the general-purpose computing platforms stem from the idea that several in-memory computing logic devices do support a universal set of Boolean logic operation and therefore, can be used for mapping arbitrary Boolean functions efficiently. In this direction, so far, researchers have concentrated on challenges in logic synthesis (e.g. depth optimization), and technology mapping (e.g. device count reduction). The important problem of efficient technology mapping of arbitrary logic network onto a crossbar array structure has been overlooked so far. In this paper, we propose, ReVAMP, a general-purpose computing platform based on Resistive RAM crossbar array, which exploits the parallelism in computing multiple logic operations in the same word. Further, we study the problem of instruction generation and scheduling for such a platform. We benchmark the performance of ReVAMP with respect to the state of the art architecture. Download Paper (PDF; Only available from the DATE venue WiFi)
### 7.3 Optimizing performance, energy and predictability via hardware/software codesign

**Date:** Wednesday 29 March 2017  
**Time:** 14:30 - 16:00  
**Location / Room:** 2BC  
**Chair:**  
**Co-Chair:** Stefano Di Carlo, Politecnico di Torino, IT

This session presents a variety of architectural solutions to improve performance/energy/predictability covering several hardware blocks: processor pipeline, caches, memory and on-chip I/O. The first paper proposes a hardware/software mechanism to classify accesses as private or shared. The second paper, introduces a low-power asynchronous microprocessor design. The third paper proposes a coordinated approach to improve performance by partitioning multilevel caches. And the last paper proposes a hardware approach to increase the timing accuracy of I/O operations.
15:45 7.3.1 ACCURATE PRIVATE/SHARED CLASSIFICATION OF MEMORY ACCESSES: A RUN-TIME ANALYSIS SYSTEM FOR THE LEON3 MULTI-CORE PROCESSOR

Speaker: Nam Ho, Department of Computer Science, University of Paderborn, DE
Authors: Nam Ho, Ishaq Ibne Ashraf, Paul Kaufmann and Marco Platzner, Department of Computer Science, University of Paderborn, Germany, DE

Abstract
Related work has presented simulation-based experiments to classify data accesses in a shared memory multi-core into private and shared. This information can be used to selectively turn on/off cache coherency mechanisms for data blocks, which can save memory bus bandwidth, minimize energy consumption, and reduce application runtimes. In this paper we present an implementation of a private/shared classification mechanism on a LEON3 SPARC multi-core processor running the Linux 2.6 kernel. Our mechanism is paged-based and allows for classifying and counting data accesses at run-time. Compared to previous work, our system provides more accurate, i.e., realistic, data as it includes a real multi-core architecture and an OS. Additionally, our prototype allows us to quantitatively evaluate the overhead for the classification mechanism. We test our system with sequential and parallel benchmarks from the MiBench, ParaMiBench, PARSEC, and SPLASH2 application suites. The results show that parallel benchmarks are promising targets for selectively controlling coherency mechanisms and that the run-time overheads induced by our mechanism are rather small.

Download Paper (PDF; Only available from the DATE venue WiFi)

15:00 7.3.2 DESIGN OF A LOW POWER, RELATIVE TIMING BASED ASYNCHRONOUS MSP430 MICROPROCESSOR

Speaker: Dipanjan Bhadra, University of Utah, US
Authors: Dipanjan Bhadra and Kenneth Stevens, University of Utah, US

Abstract
Power dissipation is one of the primary design constraints in modern digital circuits. From a magnitude of hand-held portable devices to big data analytics using high-performance computing, low energy dissipation is a key requirement for most modern devices. This paper showcases an elegant low power circuit design methodology based on Relative Timing driven asynchronous techniques. A low power MSP430 microprocessor design based on a novel asynchronous finite state machine implementation is presented. The design showcases the power benefits of the proposed asynchronous implementation over the synchronous counterpart and avoids major architectural modification which would directly influence the performance or power consumption. The implemented asynchronous MSP430 exhibits a minimum of 8X power benefit over the synchronous design for an almost identical pipeline structure and comparable throughput. The paper further elaborates on the novel asynchronous state machine design used for the application and presents an efficient method to design communicating asynchronous finite state machines in clock-less systems.

Download Paper (PDF; Only available from the DATE venue WiFi)

15:30 7.3.3 A COORDINATED MULTI-AGENT REINFORCEMENT LEARNING APPROACH TO MULTI-LEVEL CACHE CO-PARTITIONING

Speaker: Preeti Ranjan Panda, Indian Institute of Technology Delhi, IN
Authors: Rahul Jain1, Preeti Ranjan Panda2 and Sreenivas Subramoney3

Abstract
--- The widening gap between the processor and memory performance has led to the inclusion of multiple levels of caches in the modern multi-core systems. Processors with simultaneous multithreading (SMT) support multiple hardware threads on the same physical core, which results in shared private caches. Any inefficiency in the cache hierarchy can negatively impact the system performance and motivates the need to perform a co-optimization of multiple cache levels by trading off individual application throughput for better system throughput and energy-delay-product (EDP). We propose a novel coordinated multi-agent reinforcement learning technique for performing Dynamic Cache Co-partitioning, called DCC. DCC has low implementation overhead and does not require any special hardware data profilers. We have validated our proposal with 15 8-core workloads created using Spec2006 benchmarks and found it to be an effective co-partitioning technique. DCC exhibited system throughput and EDP improvement of up to 14% (gmean: 9.35%) and 19.2% (gmean: 13.5%) respectively. We believe this is the first attempt at addressing the problem of multi-level cache co-partitioning.

Download Paper (PDF; Only available from the DATE venue WiFi)

15:45 7.3.4 GPIOCP: TIMING-ACCURATE GENERAL PURPOSE I/O CONTROLLER FOR MANY-CORE REAL-TIME SYSTEMS

Speaker: Zhe Jiang, University of York, CN
Authors: Zhe Jiang and Neil Audsley, University of York, GB

Abstract
Modern SoC / NoC chips often provide General-Purpose I/O (GPIO) pins for connecting devices that are not directly integrated within the chip. Timing accurate control of devices connected to GPIO is often required within embedded real-time systems -- i.e. I/O operations should occur at exact times, with minimal error, neither being significantly early or late. This is difficult to achieve due to the latencies and contentions present in architecture, between CPU instigating the I/O operation, and the device connected to the GPIO -- software drivers, RTOS, buses and bus contentions all introduce significant variable latencies before the command reaches the device. This is compounded in NoC devices utilising a mesh interconnect between CPUs and I/O devices. The contribution of this paper is a resource efficient programmable I/O controller, termed the GPIO Command Processor (GPIOCP), that permits applications to instigate complex sequences of I/O operations at an exact time, so achieving timing-accuracy at a single clock cycle level. Also, I/O operations can be programmed to occur at some point in the future, periodically, or reactively. The GPIOCP is a parallel I/O controller, supporting cycle level timing accuracy across several devices connected to GPIO simultaneously. The GPIOCP exploits the tradeoff between placing using a full sequential CPU to control each GPIO connected device, which achieves some timing accuracy at high resource cost; and poor timing-accuracy achieved where the application CPU controls the device remotely. The GPIOCP has efficient hardware cost compared to CPU approaches, with the additional benefits of total timing accuracy (CPU solutions do not provide this in general) and parallel control of many I/O devices.

Download Paper (PDF; Only available from the DATE venue WiFi)

16:00 7.3.5 A HARDWARE IMPLEMENTATION OF THE MCAS SYNCHRONIZATION PRIMITIVE

Speaker: Smruti Sarangi, IIT Delhi, IN
Authors: Shrshthy Patel, Rajshnekar Kalayappan, Ishani Mahajan and Smruti R. Sarangi, IIT Delhi, IN

Abstract
Lock-based parallel programs are easy to write. However, they are inherently slow as the synchronization is blocking in nature. Non-blocking lock-free programs, which use atomic instructions such as compare-and-set (CAS), are significantly faster. However, lock-free programs are notoriously difficult to design and debug. This can be greatly eased if the primitives work on multiple memory locations instead of one. We propose MCAS, a hardware implementation of a multi-word compare-and-set primitive. Ease of programming aside, MCAS-based programs are 13.8X and 4X faster on an average than lock-based and traditional lock-free programs respectively. The area overhead, in a 32-core 400mn2 chip, is a mere 0.046%.

Download Paper (PDF; Only available from the DATE venue WiFi)
16:01 | IP3-11, 325 | BANDITS: DYNAMIC TIMING SPECULATION USING MULTI-ARMED BANDIT BASED OPTIMIZATION
Speaker: Jeff Zhang, New York University, US
Authors: Jeff Zhang and Siddharth Garg, New York University, US
Abstract: Timing speculation has recently been proposed as a method for increasing performance beyond that achievable by conventional worst-case design techniques. Starting with the observation of fast temporal variations in timing error probabilities, we propose a run-time technique to dynamically determine the optimal degree of timing speculation (i.e., how aggressively the processor is over-clocked) based on a novel formulation of the dynamic timing speculation problem as a multi-armed bandit problem. By conducting detailed post-synthesis timing simulations on a 5-stage MIPS processor running a variety of workloads, the proposed adaptive mechanism improves processor's performance significantly comparing with a competing approach (about 8.3% improvement); on the other hand, it shows only about 2.8% performance loss on average, compared with the oracle results.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:02 | IP3-12, 261 | DESIGN AND IMPLEMENTATION OF A FAIR CREDIT-BASED BANDWIDTH SHARING SCHEME FOR BUSES
Speaker: Carles Hernandez, Barcelona Supercomputing Center (BSC), ES
Authors: Mladen Slijepcevic, Carles Hernandez, Jaume Abella and Francisco Cazorla
1Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, ES; 2Barcelona Supercomputing Center, ES; 3Barcelona Supercomputing Center (BSC-CNS), ES; 4Barcelona Supercomputing Center and IIIA-CSIC, ES
Abstract: Fair arbitration in the access to hardware shared resources is fundamental to obtain low worst-case execution time (WCET) estimates in the context of critical real-time systems, for which performance guarantees are essential. Several hardware mechanisms exist for managing arbitration in those resources (buses, memory controllers, etc.). They typically attain fairness in terms of the number of granted slots each contender (e.g., core) gets granted access to the shared resource. However, those policies may lead to unfair bandwidth allocations for workloads with contenders issuing short requests and contenders issuing long requests. We propose a Credit-Based Arbitration (CBA) mechanism that achieves fairness in the cycles each core is granted access to the resource rather than in the number of granted slots. Furthermore, we implement CBA as part of a LEON3 4-core processor for the Space domain in an FPGA proving the feasibility and good performance characteristics of the design by comparing it against other arbitration schemes.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:00 | End of session
Coffee Break in Exhibition Area
On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017
- Coffee Break 10:30 - 11:30
- Coffee Break 16:00 - 17:00

Wednesday, March 29, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 16:00 - 17:00

Thursday, March 30, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 15:30 - 16:00

7.4 Advances in Logic Synthesis

Date: Wednesday 29 March 2017
Time: 14:30 - 16:00
Location / Room: 3A

Chair: Paolo Ienne, EPFL, CH
Co-Chair: Tsutomu Sasao, Meiji University, JP

This session focuses on new results in logic synthesis. The first two papers present specialized synthesis algorithms for index generating functions and encoder circuits. The last two papers discuss efficient encoding with SAT of short-circuit detection and combinational delay optimization.

14:30 | 7.4.1 | AN ALGORITHM TO FIND OPTIMUM SUPPORT-REDUCING DECOMPOSITIONS FOR INDEX GENERATION FUNCTIONS.
Speaker: Tsutomu Sasao, Meiji University, JP
Authors: Tsutomu Sasao, Kyu Matsuura and Yukihiro Iguchi, Meiji University, JP
Abstract: Index generation functions are useful for pattern matching, and routers in the internet, etc.. This paper presents an algorithm to find support-reducing decompositions for index generation functions. Let n be the number of the input variables, and let s be the number of bound variables. Then, the exhaustive search for finding an optimum support-reducing decomposition requires to check \( \binom{n}{s} \) combinations. We found a special property of index generation functions that drastically reduces this search space. With this property, we developed a fast algorithm to find an exact optimum solution. For a given number of bound variables, it finds a decomposition with the fewest rails. Experimental results up to \( n=60 \) and \( s=33 \) are shown.
Download Paper (PDF; Only available from the DATE venue WiFi)
<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
<th>Abstract</th>
</tr>
</thead>
<tbody>
<tr>
<td>15:00</td>
<td>7.4.2</td>
<td><strong>TAKING ONE-TO-ONE MAPPINGS FOR GRANTED: ADVANCED LOGIC DESIGN OF ENCODER CIRCUITS</strong></td>
<td>Robert Wille, Johannes Kepler University, Linz, AT; Alwin Zulehner¹ and Robert Wille²</td>
<td>Encoders play an important role in many areas such as memory addressing, data demultiplexing, or for interconnect solutions. However, design solutions for the automatic synthesis of corresponding circuits suffer from various drawbacks, e.g., they are often not scalable, do not exploit the full degree of freedom, or are applicable to realize certain codes only. All these problems are caused by the fact that existing design solutions have to explicitly guarantee a one-to-one mapping. In this work, we propose an alternative design approach which relies on dedicated description means for both, the specification of an encoder as well as its circuit. Based on that, synthesis can be conducted without the need to explicitly take care of guaranteeing one-to-one mappings. Experiments show that this indeed overcomes the drawbacks of current design solutions and leads to an improvement in the resulting number of gates by up to 92%. Download Paper (PDF; Only available from the DATE venue WiFi)</td>
</tr>
<tr>
<td>15:30</td>
<td>7.4.3</td>
<td><strong>ANALYSIS OF SHORT-CIRCUIT CONDITIONS IN LOGIC CIRCUITS</strong></td>
<td>João Alonso, INESC-ID, PT; João Pedro¹ and Jose Monteiro²</td>
<td>The proposed algorithm is more versatile than previous algorithms, which is confirmed by the experimental results. Download Paper (PDF; Only available from the DATE venue WiFi)</td>
</tr>
<tr>
<td>15:45</td>
<td>7.4.4</td>
<td><strong>BUSY MAN’s SYNTHESIS: COMBINATIONAL DELAY OPTIMIZATION WITH SAT</strong></td>
<td>Mathias Soeken, EPFL, CH; Giovanni De Micheli¹ and Alan Mishchenko²</td>
<td>After integration into a depth-optimizing mapping algorithm, the proposed SAT formulation can be used to perform logic rewriting to reduce the logic depth of a network. It is shown that to be effective the logic rewriting algorithm requires (i) a fast SAT formulation and (ii) heuristics to quickly determine whether the given delay constraints are feasible for a given function. The proposed algorithm is more versatile than previous algorithms, which is confirmed by the experimental results. Download Paper (PDF; Only available from the DATE venue WiFi)</td>
</tr>
<tr>
<td>16:00</td>
<td>IP3-</td>
<td><strong>TECHNOLOGY MAPPING WITH ALL SPIN LOGIC</strong></td>
<td>Azadeh Davoodi, University of Wisconsin - Madison, US; Boyu Zhang¹ and Azadeh Davoodi²</td>
<td>This work is the first to propose a technology mapping algorithm for All Spin Logic (ASL) device. The ASL device is the most actively-pursued one among spintronics devices which themselves fall under emerging post-CMOS nano-technologies. We identify the shortcomings of directly applying the classical technology mapping with ASL devices, and propose techniques to extend the classical procedure to handle these shortcomings. Our results show that our ASL-aware technology mapping algorithm can achieve on-average 9.15% and up to 27.27% improvement in delay (when optimizing delay) with slight improvement in area, compared to the solution generated by classical technology mapping. In a broader sense, our results show the need for developing circuit-level CAD tools that are aware of and optimized for emerging technologies in order to better assess their promise as we move to the post-CMOS era. Download Paper (PDF; Only available from the DATE venue WiFi)</td>
</tr>
<tr>
<td>16:01</td>
<td>IP3-</td>
<td><strong>A NEW METHOD TO IDENTIFY THRESHOLD LOGIC FUNCTIONS</strong></td>
<td>Spyros Tragoudas, Southern Illinois University Carbondale, US; Seyed Nima Mozaffari,</td>
<td>An Integer Linear Programming based method to identify current mode threshold logic functions is presented. The approach minimizes the transistor count and benefits from a generalized definition of threshold logic functions. Process variations are taken into consideration. Experimental results show that many more functions can be implemented with predetermined hardware overhead, and the hardware requirement of a large percentage of existing threshold functions is reduced. Download Paper (PDF; Only available from the DATE venue WiFi)</td>
</tr>
</tbody>
</table>
### 7.5 Hot Topic Session: The Engineering Challenges for Quantum Computing

**Date:** Wednesday 29 March 2017  
**Time:** 14:30 - 16:00  
**Location / Room:** 3C

**Organisers:**  
Koen Bertels, QuTech & Computer Engineering Lab, NL  
Carmen G. Almudéver, QuTech & Computer Engineering Lab, NL

**Chair:**  
Edoardo Charbon, Delft University of Technology, NL  
**Co-Chair:**  
Said Hamdioui, Delft University of Technology, NL

Quantum computers may revolutionize the field of computation by solving some complex problems that are intractable even for the most powerful current supercomputers. This session will explain the basic concepts of quantum computing and describe what the required layers are for building a quantum system. The different speakers in the session will then address the engineering challenges when building a quantum computer ranging from the core qubit technology, the control electronics, to the microarchitecture for the execution of quantum circuits and efficient quantum error correction and what compiler and system tools are needed in that context.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
</table>
| 14:30 | 7.5.1 | **WHAT IS QUANTUM COMPUTING ALL ABOUT?** | Carmen G. Almudever, Delft University of Technology, NL  
Authors: Carmen G. Almudever and Koen Bertels, Delft University of Technology, NL |
| 15:00 | 7.5.2 | **QUANTUM PROCESSOR** | Andreas Wallraff, ETH Zurich, CH |
| 15:30 | 7.5.3 | **CONTROL ELECTRONICS FOR QUANTUM COMPUTER** | Hendrik Bluhm, RWTH Aachen, DE |

### 7.6 Memory Reliability: Modeling and Mitigation

**Date:** Wednesday 29 March 2017  
**Time:** 14:30 - 16:00  
**Location / Room:** 5A

**Chair:**  
Jose Pineda De Gyvez, NXP, NL  
**Co-Chair:**  
Vikas Chandra, ARM, US

This session discusses new trends and solutions to model and mitigate resiliency challenges for advanced memory technologies. The first paper discusses unequal protection for
more efficient memory resiliency. The second paper analyzes the aging impact on different memory components. Finally, the third paper proposes mitigation schemes for memory peripheral circuitry.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>14:30</td>
<td>7.6.1</td>
<td>(Best Paper Award Candidate) MVP ECC : MANUFACTURING PROCESS VARIATION AWARE UNEQUAL PROTECTION ECC FOR MEMORY RELIABILITY</td>
<td>Joon-Sung Yang, Sungkyunkwan University, KR</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speaker: Joon-Sung Yang, Sungkyunkwan University, KR</td>
<td>Authors: Seungyeb Lee and Joon-Sung Yang, Sungkyunkwan University, KR</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract: With a development of process technology, a memory density has been increased. However, a smaller feature size makes the memory susceptible to soft errors. For reliability enhancement, ECC with single bit error correction and double bit error detection is widely used. As multiple bit cell upset became dominant, there is a need for stronger ECC. ECC such as RS or BCH code requires significantly large overhead and longer latency. To overcome the problem, this paper introduces an unequal protection ECC assigning stronger level of protection to weak memory cells and normal level to normal cells. Information from manufacturing characterization test is utilized to identify weak memory cells with low design margins. Instead of equally treating all memory cells, the proposed ECC focuses more on the weak cells since they are more susceptible to soft errors. Compared to conventional ECCs, experimental results show that the proposed ECC considerably enhances memory reliability with the same code length.</td>
<td>Download Paper (PDF; Only available from the DATE venue WiFi)</td>
</tr>
<tr>
<td>15:00</td>
<td>7.6.2</td>
<td>ANALYZING THE EFFECTS OF PERIPHERAL CIRCUIT AGING OF EMBEDDED SRAM ARCHITECTURES</td>
<td>Josef Kinseher, Intel Deutschland, DE</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speaker: Josef Kinseher, Intel Deutschland, DE</td>
<td>Authors: Josef Kinseher 1, Leonhard Heiß 1 and Ilia Polian 2</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract: Modern System-on-Chips rely heavily on the performance of their embedded memories which are also most susceptible to the increasing reliability challenges of today’s nanoscale technology nodes. However, in contrast to memory core-cells, the effects of transistor aging inside the peripheral logic of SRAM architectures have received little attention. This study works out how BTI and HCI induced wear-out of the peripheral SRAM circuitry impacts various performance metrics of an industrially used memory library. We show that the degradation of the peripheral logic is the dominant driver for access speed loss while it tends to slightly lower memory read margin and lead to minor improvements of write margin. We furthermore show that in terms of access margin the degradation of SRAM control circuitry counteracts aging effects inside core-cells and sense amplifiers. Surprisingly, wear-out of peripheral circuitry can even improve access margin in case when the relative magnitude of BTI is much lower compared with NBTI. Based on the example of an embedded memory library, this study further underlines the importance to analyze aging mechanisms at system level rather than for its individual interacting sub-circuits.</td>
<td>Download Paper (PDF; Only available from the DATE venue WiFi)</td>
</tr>
<tr>
<td>15:30</td>
<td>7.6.3</td>
<td>MITIGATION OF SENSE AMPLIFIER DEGRADATION USING INPUT SWITCHING</td>
<td>Daniel Kraak, Delft University of Technology, NL</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speaker: Daniel Kraak, Delft University of Technology, NL</td>
<td>Authors: Daniel Kraak 1, Innocent Agbo 1, Mottaqiallah Taouil 1, Said Hamdioui 1, Pieter Weckx 2, Stefan Cosemans 2, Francky Catthoor 2 and Wim Dehaene 3</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract: To compensate for time-zero (due to process variation) and time-dependent (due to e.g. Bias Temperature Instability) variability, designers usually add design margins. Due to technology scaling, these variabilities become worse, leading to the need for bigger design margins. Typically, only worst-case scenarios are considered, which will not present the actual workload of the targeted application. Alternatively, mitigation schemes can be used to counteract the variability. This paper presents a run-time design-for-reliability scheme for memory Sense Amplifiers (SAs); SAs are an integral part of any memory system and are very critical for high performance. The proposed scheme mitigates the impact of time-dependent variability due to aging by using an on-line control circuit to create a balanced workload. The simulation results show that the proposed scheme can reduce the most critical figures-of-merit, namely the offset voltage shift and the sensing delay of the SA with up to ~40% and ~10%, respectively, depending on the stress conditions (temperature, voltage, workload).</td>
<td>Download Paper (PDF; Only available from the DATE venue WiFi)</td>
</tr>
<tr>
<td>16:00</td>
<td>IP3-</td>
<td>A BRIDGING FAULT MODEL FOR LINE COVERAGE IN THE PRESENCE OF UNDETECTED TRANSITION FAULTS</td>
<td>Irith Pomeranz, Purdue University, US</td>
</tr>
<tr>
<td>15, 16</td>
<td></td>
<td>Speaker and Author: Irith Pomeranz, Purdue University, US</td>
<td>Abstract: A variety of fault models have been defined to capture the behaviors of commonly occurring defects and ensure a high quality of testing. When several fault models are used for test generation, it is advantageous if the existence of an undetectable fault in one model does not imply that a fault in the same component but from a different model is also undetectable. This allows a test set to cover the circuit more thoroughly when additional fault models are used. This paper studies the possibility of defining such fault models by considering transition faults as the first fault model, and bridging faults as the second fault model. The bridging faults are defined to cover lines for which transition faults are not detected. A test compaction procedure is developed to demonstrate the bridging fault coverage that can be achieved, and the effect on the number of tests.</td>
</tr>
<tr>
<td>16:00</td>
<td></td>
<td>End of session</td>
<td>Coffee Break in Exhibition Area</td>
</tr>
</tbody>
</table>

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

- **Coffee Break 10:30 - 11:30**
- **Coffee Break 16:00 - 17:00**

**7.7 Resource management and analysis for embedded architectures**

**Date:** Wednesday 29 March 2017
Embedded architectures have to often provide application performance guarantees despite stringent resource constraints. The talks in this session provide solutions to managing these constraints.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>14:30</td>
<td>7.7.1</td>
<td>(Best Paper Award Candidate) SCALABLE PROBABILISTIC POWER BUDGETING FOR MANY-CORES</td>
<td>Anuj Pathania, Karlsruhe Institute of Technology, IN</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speaker:</td>
<td>Anuj Pathania1, Heba Khdr2, Muhammad Shafique3, Tulika Mitra4 and Joerg Henkel1</td>
</tr>
<tr>
<td></td>
<td></td>
<td>1Karlsruhe Institute of Technology, DE; 2Karlsruhe Institute of Technology (KIT), DE; 3Vienna University of Technology (TU Wien), AT; 4National University of Singapore, SG</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract:</td>
<td>Many-core processors exhibit hundreds to thousands of cores, which can execute lots of multi-threaded tasks in parallel. Restrictive power dissipation capacity of a many-core prevents all its executing tasks from operating at their peak performance together. Furthermore, the ability of a task to exploit part of the power budget allocated to it depends upon its current execution phase. This mandates careful rationing of the power budget amongst the tasks for full exploitation of the many-core. Past research proposed power budgeting techniques that redistribute power budget amongst tasks based on up-to-date information about their current phases. This phase information needs to be constantly propagated throughout the system and processed, inhibiting scalability. We solve this causality dilemma with a concept for executing a sequence of scenarios, and demonstrate an implementation on multiple processors with rolling variable execution times and scalable parallelism. Although FSM-SADF specifies which scenario transitions are possible, it does not specify how and when they are decided at runtime. Multiple actors of a scenario, e.g. video stream header parsing, may have to fire before it is known which scenario the application is in. We exploit the many-core's ability to handle multiple scenarios in parallel. Our experimental results show that CHRT achieves significantly higher energy efficiency than the baseline runtime system that employs the breadth-first scheduler and the state-of-the-art criticality-aware runtime system. (i.e., core types, counts, and voltage/frequency levels) to maximize the overall efficiency. Our experimental results show that CHRT achieves significantly higher energy efficiency than the baseline runtime system that employs the breadth-first scheduler and the state-of-the-art criticality-aware runtime system.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Download Paper (PDF; Only available from the DATE venue WiFi)</td>
<td></td>
</tr>
<tr>
<td>15:00</td>
<td>7.7.2</td>
<td>EXPLOITING SPORADIC SERVERS TO PROVIDE BUDGET SCHEDULING FOR ARINC653 BASED REAL-TIME VIRTUALIZATION ENVIRONMENTS</td>
<td>Matthias Beckert, Institute of Computer and Network Engineering, TU Braunschweig, DE</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speaker:</td>
<td>Matthias Beckert1, Kai Björn Gemlau2 and Rolf Ernst2</td>
</tr>
<tr>
<td></td>
<td></td>
<td>1Institut für Datentechnik und Kommunikationsnetze - TU Braunschweig, DE; 2TU Braunschweig, DE</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract:</td>
<td>Virtualization techniques for embedded real-time systems typically employ TDMA scheduling to achieve temporal isolation among different virtualized partitions. Due to the fixed TDMA schedule, worst case response times for IRQs and tasks are significantly increased. Recent publications introduced slack based IRQ shaping to mitigate this problem. While providing better response times for IRQs, those mechanisms neither improve task timings nor provide a work conserving scheduling. In order to provide such capabilities while still providing temporal isolation, we introduce a method based on the well known sporadic server model. In combination with a proposed budget scheduler the system is able to schedule a TDMA based configuration while providing better response times and the same amount of temporal isolation. We show correctness of the approach and evaluate it in a hypervisor implementation.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Download Paper (PDF; Only available from the DATE venue WiFi)</td>
<td></td>
</tr>
<tr>
<td>15:30</td>
<td>7.7.3</td>
<td>PROGRAMMING AND ANALYSING SCENARIO-AWARE DATAFLOW ON A MULTI-PROCESSOR PLATFORM</td>
<td>Reinier van Kampenhout, Eindhoven University of Technology, NL</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speaker:</td>
<td>Reinier van Kampenhout, Sander Stuijk and Kees Goossens, Eindhoven University of Technology, NL</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract:</td>
<td>The FSM-SADF model of computation is especially suitable for analysing real-time applications with input-dependent behaviour such as different modes, variable execution times and scalable parallelism. Although FSM-SADF specifies which scenario transitions are possible, it does not specify how and when they are decided at runtime. Multiple actors of a scenario, e.g. video stream header parsing, may have to fire before it is known which scenario the application is in. We solve this causality dilemma with a concept for executing a sequence of scenarios, and demonstrate an implementation on multiple processors with rolling static-order scheduling. We furthermore present a platform-aware analysis model that covers concept and implementation, and integrate the contributions in a toolflow. A proof-of-concept confirms the low overhead of the implementation and the exact timing analysis of our model.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Download Paper (PDF; Only available from the DATE venue WiFi)</td>
<td></td>
</tr>
<tr>
<td>16:00</td>
<td>7.7.4</td>
<td>CHRT: A CRITICALITY- AND HETEROGENEITY-AWARE RUNTIME SYSTEM FOR TASK-PARALLEL APPLICATIONS</td>
<td>Myeonggyun Han, UNIST, KR</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speaker:</td>
<td>Myeonggyun Han, Jinsu Park and Woongki Baek, UNIST, KR</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract:</td>
<td>Heterogeneous multiprocessing (HMP) is an emerging technology for high-performance and energy-efficient computing. While task parallelism is widely used in various computing domains from the embedded to machine-learning computing domains, relatively little work has been done to investigate the efficient runtime support that effectively utilizes the criticality of the tasks of the target application and the heterogeneity of the underlying HMP system. To bridge this gap, we propose a criticality- and heterogeneity-aware runtime system for task-parallel applications (CHRT). CHRT dynamically estimates the performance and power consumption of the task-parallel application and robustly manages the full HMP system resources (i.e., core types, counts, and voltage/frequency levels) to maximize the overall efficiency. Our experimental results show that CHRT achieves significantly higher energy efficiency than the baseline runtime system that employs the breadth-first scheduler and the state-of-the-art criticality-aware runtime system.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Download Paper (PDF; Only available from the DATE venue WiFi)</td>
<td></td>
</tr>
</tbody>
</table>
MOBIXEN: PORTING XEN ON ANDROID DEVICES FOR MOBILE VIRTUALIZATION
Speaker: Jianguo Yao, Shanghai Jiao Tong University, CN
Authors: Yaozu Dong¹, Jianguo Yao², Haibing Guan², Ananth. Krishna R¹ and Yunhong Jiang¹
Intel, US; ²Shanghai Jiao Tong University, CN
Abstract
The mobile virtualization technology provides a feasible way to improve the manageability and security for embedded systems. This paper presents an architecture named MobiXen to address these challenges. In the MobiXen, both Xen's physical memory space and virtual address space are shrunk as much as possible and thus Android owns more memory resource; optimizations are developed to reduce the virtualization overhead when Android is accessing system resources; new policies are implemented to achieve low suspend/resume latency. With these work adopted, MobiXen is customized as a high efficient mobile hypervisor. Detailed implementations shows that, most of the performance degradation brought by MobiXen is less than 3%, which is imperceptible by end users.

OPTIMISATION OPPORTUNITIES AND EVALUATION FOR GPGPU APPLICATIONS ON LOW-END MOBILE GPUs
Speaker: Leonidas Kosmidis, Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES
Authors: Matina Maria Trompouki¹ and Leonidas Kosmidis²
¹Universitat Politècnica de Catalunya, ES; ²Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES
Abstract
Previous works in the literature have shown the feasibility of general purpose computations for non-visual applications on low-end mobile graphics processors using graphics APIs. These works focused only on the functional aspects of the software, ignoring the implementation details and therefore their performance implications due to their particular micro-architecture. Since various steps in such applications can be implemented in multiple ways, we identify optimisation opportunities, explore the different options and evaluate them. We show that the implementation details can significantly affect the obtained performance with discrepancies up to 3 orders of magnitude and we demonstrate the effectiveness of our proposal on two embedded platforms, obtaining more than 16x speedup over benchmarks designed following OpenGL ES 2 best practices.

Coffee Break in Exhibition Area
On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017
  - Coffee Break 10:30 - 11:30
  - Coffee Break 16:00 - 17:00

Wednesday, March 29, 2017
  - Coffee Break 10:00 - 11:00
  - Coffee Break 16:00 - 17:00

Thursday, March 30, 2017
  - Coffee Break 10:00 - 11:00
  - Coffee Break 15:30 - 16:00

7.8 Smart Energy and Self-Powered Devices
Date: Wednesday 29 March 2017
Time: 14:30 - 15:30
Location / Room: Exhibition Theatre
Organiser: Patrick Mayor, EPFL, CH
The goal of this session is to present concrete examples of novel designs for next-generation energy-efficient computing architectures and real-time monitoring and management of smart grids, as well as robust low-power networks of acoustic detectors for natural hazard warning systems.
On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>16:00</td>
<td>Coffee Break</td>
<td>in Exhibition Area</td>
</tr>
<tr>
<td></td>
<td>Tuesday, March 28, 2017</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Coffee Break 10:30 - 11:30</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Coffee Break 16:00 - 17:00</td>
<td></td>
</tr>
<tr>
<td>Wednesday, March 29, 2017</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>Coffee Break 10:00 - 11:00</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Coffee Break 16:00 - 17:00</td>
<td></td>
</tr>
<tr>
<td>Thursday, March 30, 2017</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>Coffee Break 10:00 - 11:00</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Coffee Break 15:30 - 16:00</td>
<td></td>
</tr>
</tbody>
</table>

**IP3 Interactive Presentations**

**Date:** Wednesday 29 March 2017  
**Time:** 16:00 - 16:30  
**Location / Room:** IP sessions (in front of rooms 4A and 5A)

Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the afternoon. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award 'Best IP of the Day' is given.

**IP3-1 LEVERAGING AGING EFFECT TO IMPROVE SRAM-BASED TRUE RANDOM NUMBER GENERATORS**  
**Speaker:** Mohammad Saber Golanbari, Karlsruhe Institute of Technology (KIT), DE  
**Authors:** Saman Kiamehri\(^1\), Mohammad Saber Golanbari\(^2\) and Mehdi Tahoori\(^2\)  
\(^1\)Karlsruhe Institute of Technology (KIT), DE; \(^2\)Karlsruhe Institute of Technology, DE  
**Abstract**  
The start-up value of SRAM cells can be used as the random number vector or a seed for the generation of a pseudo random number. However, the randomness of the generated number is pretty low since many of the cells are largely skewed due to process variation and their start-up value leans toward zero or one. In this paper, we propose an approach to increase the randomness of SRAM-based True Random Number Generators (TRNGs) by leveraging transistor aging impact. The idea is to iteratively power-up the SRAM cells and put them under accelerated aging to make the cells less skewed and hence obtaining a more random vector. The simulation results show that the min-entropy of SRAM-based TRNG increases by 10X using this approach.  
**Download Paper (PDF; Only available from the DATE venue WiFi)**

**IP3-2 DESIGN AUTOMATION FOR OBFUSCATED CIRCUITS WITH MULTIPLE VIABLE FUNCTIONS**  
**Speaker:** Shahrzad Keshavarz, University of Massachusetts Amherst, US  
**Authors:** Shahrzad Keshavarz\(^1\), Christof Paar\(^2\) and Daniel Holcomb\(^1\)  
\(^1\)University of Massachusetts Amherst, US; \(^2\)Horst Gortz Institut for IT-Security, Ruhr-Universitat Bochum, DE  
**Abstract**  
Gate camouflaging is a technique for obfuscating the function of a circuit against reverse engineering attacks. However, if an adversary has pre-existing knowledge about the set of functions that are viable for an application, random camouflaging of gates will not obfuscate the function well. In this case, the adversary can target their search, and only needs to decide whether each of the viable functions could be implemented by the circuit. In this work, we propose a method for using camouflaged cells to obfuscate a design that has a known set of viable functions. The circuit produced by this method ensures that an adversary will not be able to rule out any viable functions unless she is able to uncover the gate functions of the camouflaged cells. Our method comprises iterated synthesis within an overall optimization loop to combine the viable functions, followed by technology mapping to deploy camouflaged cells while maintaining the plausibility of all viable functions. We evaluate our technique on cryptographic S-box functions and show that, relative to a baseline approach, it achieves up to 38% area reduction in PRESENT-style S-Boxes and 48% in DES S-Boxes.  
**Download Paper (PDF; Only available from the DATE venue WiFi)**

**IP3-3 DOUBLE MAC: DOUBLING THE PERFORMANCE OF CONVOLUTIONAL NEURAL NETWORKS ON MODERN FPGAS**  
**Speaker:** Jongeun Lee, UNIST, KR  
**Authors:** Dong Nguyen\(^1\), Daewoo Kim\(^1\) and Jongeun Lee\(^2\)  
\(^1\)UNIST, KR; \(^2\)Ulsan National Institute of Science and Technology (UNIST), KR  
**Abstract**  
This paper presents a novel method to double the computation rate of convolutional neural network (CNN) accelerators by packing two multiply-and-accumulate (MAC) operations into one DSP block of off-the-shelf FPGAs (called Double MAC). While a general SIMD MAC using a single DSP block seems impossible, our solution is tailored for the kind of MAC operations required for a convolution layer. Our preliminary evaluation shows that not only can our Double MAC approach increase the computation throughput of a CNN layer by twice with essentially the same resource utilization, the network level performance can also be improved by 14~84% over a highly optimized state-of-the-art accelerator solution depending on the CNN hyper-parameters.  
**Download Paper (PDF; Only available from the DATE venue WiFi)**
A WEAR-LEVELING-AWARE COUNTER MODE FOR DATA ENCRYPTION IN NON-VOLATILE MEMORIES

Speaker: Fangting Huang, Huazhong University of Science and Technology, CN
Authors: Fangting Huang1, Dan Feng2, Yu Hua3 and Wen Zhou2
1Huazhong University of Science and Technology, CN; 2Wuhan National Lab for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, China, CN

Abstract
Counter-mode encryption has been widely used to resist NVMs from malicious attacks, due to its proved security and high performance. However, this scheme suffers from the counter size versus re-encryption problem, where per-line counters must be relatively large to avoid counter overflow, or re-encryption of the entire memory is required to ensure security. In order to address this problem, we propose a novel wear-leveling-aware counter mode for data encryption, called Resetting Counter via Remapping (RCR). The basic idea behind RCR is to leverage wear-leveling remappings to reset the line counter. With carefully designed procedure, RCR avoids counter overflow with much smaller counter size. The salient features of RCR include low storage overhead of counters, high counter cache hit ratio, and no extra re-encryption overhead. Compared with state-of-the-art works, RCR obtains significant performance improvements, e.g., up to a 57% reduction in the IPC degradation, under the evaluation of 8 memory-intensive benchmarks from SPEC 2006.

Download Paper (PDF; Only available from the DATE venue WiFi)
A new method to identify threshold logic functions

Speaker: Spyros Tragoudas, Southern Illinois University Carbondale, US

Authors: Seyed Nima Mozaffari, Spyros Tragoudas and Themistoklis Haniotakis, Southern Illinois University, US

Abstract

A new method to identify threshold logic functions is presented. The approach minimizes the transistor count and benefits from a generalized definition of threshold logic functions. Process variations are taken into consideration. Experimental results show that many more functions can be implemented with predetermined hardware overhead, and the hardware requirement of a large percentage of existing threshold functions is reduced.
IP3-15  A BRIDGING FAULT MODEL FOR LINE COVERAGE IN THE PRESENCE OF UNDETECTED TRANSITION FAULTS  
Speaker and Author:  
Irith Pomeranz, Purdue University, US  
Abstract  
A variety of fault models have been defined to capture the behaviors of commonly occurring defects and ensure a high quality of testing. When several fault models are used for test generation, it is advantageous if the existence of an undetectable fault in one model does not imply that a fault in the same component but from a different model is also undetectable. This allows a test set to cover the circuit more thoroughly when additional fault models are used. This paper studies the possibility of defining such fault models by considering transition faults as the first fault model, and bridging faults as the second fault model. The bridging faults are defined to cover lines for which transition faults are not detected. A test compaction procedure is developed to demonstrate the bridging fault coverage that can be achieved, and the effect on the number of tests.

Download Paper (PDF; Only available from the DATE venue WiFi)

IP3-16  CHRT: A CRITICALITY- AND HETEROGENEITY-AWARE RUNTIME SYSTEM FOR TASK-PARALLEL APPLICATIONS  
Speaker:  
Myeonggyun Han, UNIST, KR  
Authors:  
Myeonggyun Han, Jinsu Park and Woongki Baek, UNIST, KR  
Abstract  
Heterogeneous multiprocessing (HMP) is an emerging technology for high-performance and energy-efficient computing. While task parallelism is widely used in various computing domains from the embedded to machine-learning computing domains, relatively little work has been done to investigate the efficient runtime support that effectively utilizes the criticality of the tasks of the target application and the heterogeneity of the underlying HMP system with full resource management. To bridge this gap, we propose a criticality- and heterogeneity-aware runtime system for task-parallel applications (CHRT). CHRT dynamically estimates the performance and power consumption of the target task-parallel application and robustly manages the full HMP system resources (i.e., core types, counts, and voltage/frequency levels) to maximize the overall efficiency. Our experimental results show that CHRT achieves significantly higher energy efficiency than the baseline runtime system that employs the breadth-first scheduler and the state-of-the-art criticality-aware runtime system.

Download Paper (PDF; Only available from the DATE venue WiFi)

IP3-17  MOBIXEN: PORTING XEN ON ANDROID DEVICES FOR MOBILE VIRTUALIZATION  
Speaker:  
Jianqiu Yao, Shanghai Jiao Tong University, CN  
Authors:  
Yaozu Dong¹, Jianguo Yao², Haibing Guan², Ananth. Krishna R¹ and Yunhong Jiang¹  
¹Intel, US; ²Shanghai Jiao Tong University, CN  
Abstract  
The mobile virtualization technology provides a feasible way to improve the manageability and security for embedded systems. This paper presents an architecture named MobiXen to address these challenges. In the MobiXen, both Xen’s physical memory space and virtual address space are shrunk as much as possible and thus Android owns more memory resource; optimizations are developed to reduce the virtualization overhead when Android is accessing system resources; new policies are implemented to achieve low suspend/resume latency. With these work adopted, MobiXen is customized as a high efficient mobile hypervisor. Detailed implementations shows that, most of the performance degradation brought by MobiXen is less than 2%, which is imperceptible by end users.

Download Paper (PDF; Only available from the DATE venue WiFi)

IP3-18  OPTIMISATION OPPORTUNITIES AND EVALUATION FOR GPGPU APPLICATIONS ON LOW-END MOBILE GPUs  
Speaker:  
Leonidas Kosmidis, Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES  
Authors:  
Matina Maria Trompouki¹ and Leonidas Kosmidis²  
¹Universitat Politècnica de Catalunya, ES; ²Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES  
Abstract  
Previous works in the literature have shown the feasibility of general purpose computations for non-visual applications on low-end mobile graphics processors using graphics APIs. These works focused only on the functional aspects of the software, ignoring the implementation details and therefore their performance implications due to their particular micro-architecture. Since various steps in such applications can be implemented in multiple ways, we identify optimisation opportunities, explore the different options and evaluate them. We show that the implementation details can significantly affect the obtained performance with discrepancies up to 3 orders of magnitude and we demonstrate the effectiveness of our proposal on two embedded platforms, obtaining more than 16x speedup over benchmarks designed following OpenGL ES 2 best practices.

Download Paper (PDF; Only available from the DATE venue WiFi)

IP7  Ten Cent Chip Challenge - Interactive Presentations  
Date: Wednesday 29 March 2017  
Time: 16:00 - 18:00  
Location / Room: IP session (in front of room SBC)

<table>
<thead>
<tr>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>IP7.1.1</td>
<td>A LOW-POWER IOT PROCESSOR INTEGRATING VOLTAGE-SCALABLE FULLY DIGITAL MEMORIES</td>
<td>Hidetoshi Ondotera, Kyoto University, JP</td>
</tr>
<tr>
<td>IP7.1.2</td>
<td>A SIMPLE, STATELESS, COST EFFECTIVE SYMMETRIC CRYPTOGRAPHY STRATEGY FOR ENERGY-HARVESTING IOT DEVICES</td>
<td>Jan Madsen, Technical University of Denmark, DK</td>
</tr>
<tr>
<td>IP7.1.3</td>
<td>RECONFIGURABLE MICROCONTROLLER FOR END NODES IN INTERNET OF THINGS</td>
<td>Wai-Chung Matthew Tang, Queen Mary University of London, GB</td>
</tr>
<tr>
<td>IP7.1.4</td>
<td>FURTHER SIMPLIFICATION OF APPROXIMATE ADDERS USING INPUT DATA RANGES IN IOT</td>
<td>Jeong-A Lee, Chosun University, KR</td>
</tr>
</tbody>
</table>

UB08 Session 8  
Date: Wednesday 29 March 2017  
Time: 16:00 - 18:00  
Location / Room: Booth 1, Exhibition Area

<table>
<thead>
<tr>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
</table>
UB08.1 COSSIM: A NOVEL, COMPREHENSIBLE, ULTRA-FAST, SECURITY-AWARE CPS SIMULATOR

Presenter:
Nikolaos Tampouratzis, Technical University of Crete, GR

Authors:
Antonios Nikitakis and Andreas Brokalakis, Synelixis Solutions Ltd, GR

Abstract
One of the main problems Cyber Physical Systems (CPS) and Highly Parallel Systems (HPS) designers face is the lack of simulation tools and models for system design and analysis. This is mainly because the majority of the existing simulation tools can handle efficiently only parts of a system (e.g. only the processing or only the network) while none of them supports the notion of security. Moreover, most of the existing simulators need extreme amounts of processing resources while faster approaches cannot provide the necessary precision and accuracy. COSSIM is an open-source framework that seamlessly simulates, in an integrated way, the networking and the processing parts of the CPS and Highly Parallel Heterogeneous Systems. In addition, COSSIM supports accurate power estimations while it is the first such tool supporting security as a feature of the design process. The complete COSSIM framework together with its sophisticated GUI will be presented.

More information ...

UB08.2 NETFI-2: AN AUTOMATIC METHOD FOR FAULT INJECTION ON HDL-BASED DESIGNS

Presenter:
Alexandre Coelho, Université Grenoble Alpe, FR

Authors:
Miguel Solinas, Juan Fraire, Nacer-Eddine Zergainoh, Pablo Ferreyra and Raouf Velazco, TIMA, FR

Abstract
Fault injection tools, which include fault simulation and emulation, are a well-known technique to evaluate the susceptibility of integrated circuits to the effects of radiation. This work presents a methodology to emulate Single Event Upsets (SEUs) and Single Event Transients (SETs) in a Field Programmable Gate Array (FPGA). The method proposed combines the flexibility of FPGA with the controllability provided by the MicroBlaze, to emulate HDL circuit and control the fault injection campaign. This approach has been integrated into a fault-injection platform, named NETFI (NETlist Fault Injection), developed by our research group, and received the name of NETFI-2. To validate this methodology fault injection campaign have been performed in Leon3 and Stochastic Bayesian Machine. Results on an Artix-7 FPGA show that NETFI-2 provides accurate measurements while improving the execution time of the experiment by more than 300% compared with analogous simulation-based campaigns.

More information ...

UB08.5 ITMD: RUN-TIME MANAGEMENT OF CONCURRENT MULTITHREADED APPLICATIONS ON HETEROGENEOUS MULTI-CORES

Presenter:
Karunakar Reddy Basireddy, University of Southampton, GB

Authors:
Amit Singh, Bashir M. Al-Hashimi and Geoff V. Merrett, University of Southampton, GB

Abstract
Heterogeneous multi-cores often need to deal with multiple applications having different performance requirements concurrently, which generate varying and mixed workloads. Runtime management is required to adapt to such performance requirements and workload variabilities, and to achieve energy efficiency. It is challenging to efficiently exploit different types of cores simultaneously and DVFS potential of cores. We present a run-time management approach that first selects thread-to-core mapping based on the performance requirements and resource availability. Then, it applies online adaptation by adjusting the voltage-frequency (V-F) levels to achieve energy optimization. We demonstrate the proposed run-time management approach in Odroid XU4, with various combinations of multi-threaded applications from PARSEC and SPLASH benchmarks. Results show an average improvement in energy efficiency up to 33% compared to existing approaches.

More information ...

UB08.6 GNOC: AN ULTRA-FAST, HIGHLY EXTENSIBLE, CYCLE-ACCURATE GPU-BASED PARALLEL NETWORK-ON-CHIP SIMULATOR

Presenter:
Amir CHARIF, TIMA, FR

Authors:
Nacer-Eddine Zergainoh and Michael Nicolaidis, TIMA, FR

Abstract
With the continuous decrease in feature sizes and the recent emergence of 3D stacking, chips comprising thousands of nodes are becoming increasingly relevant, and state-of-the-art NoC simulators are unable to simulate such a high number of nodes in reasonable times. In this demo, we showcase GNOCs, the first detailed, modular and scalable parallel NoC simulator running fully on GPU (Graphics Processing Unit). Based on a unique design specifically tailored for GPU parallelism, GNOCs is able to achieve unprecedented speedups with no loss of accuracy. To enable quick and easy validation of novel ideas, the programming model was designed with high extensibility in mind. Currently, GNOCs accurately models a VC-based microarchitecture. It supports 2D and 3D mesh topologies with full or partial vertical connections. A variety of routing algorithms and synthetic traffic patterns, as well as dependency-driven trace-based simulation (Netrace), are implemented and will be demonstrated.

More information ...

UB08.8 SELINK: SECURING HTTP AND HTTPS-BASED COMMUNICATION VIA SEcube™

Presenter:
Airofarulla Giuseppe, CINI & Politecnico di Torino, IT

Authors:
Paolo Pinnetto 1 and Antonio Varriale 2
1Politecnico di Torino, IT; 2Blu5 Labs Ltd., IT

Abstract
The SEcube™ Open Source platform is a combination of three main cores in a single-chip design. Low-power ARM Cortex-M4 processor, a flexible and fast Field-Programmable-Gate Array (FPGA), and an EAL5+ certified Security Controller (SmartCard) are embedded in an extremely compact package. This makes it a unique Open Source security environment where each function can be optimized, executed, and verified on its proper hardware device. In this demo, we present a client-server HTTP and HTTPS-based application, for which the traffic is encrypted escaping to the hardware built-in capabilities, and the software libraries, of the SEcube™. By doing so, we show how communication can be secured from an attacker capable of inspecting, and tampering, the regular communication.

More information ...

UB08.9 HEPSYCODE: A SYSTEM-LEVEL METHODOLOGY FOR HW/SW CO-DESIGN OF HETEROGENEOUS PARALLELDEDICATED SYSTEMS

Presenter:
Luigi Pomante, University of L’Aquila, IT

Authors:
Giacomo Valente 1, Vittoriano Muttillo 1, Daniele Di Pompeos 1, Emilio Incerto 2 and Daniele Ciambrone 1
1University of L’Aquila, IT; 2Gran Sasso Science Institute, IT

Abstract
Heterogeneous parallel systems have been recently exploited for a wide range of application domains, for both the dedicated (e.g. embedded) and the general purpose products. Such systems can include different processor cores, memories, dedicated ICs and a set of connections between them. They are so complex that the design methodology plays a major role in determining the success of the products. So, this demo addresses the problem of the electronic system-level hw/sw co-design of heterogeneous parallel dedicated systems. In particular, it shows an enhanced CSP/SystemC-based design space exploration step (and related ESL-EDA prototype tools), in the context of an existing hw/sw co-design flow that, given the system specification and related F/NF requirements, is able to (semi)automatically propose to the designer: - a custom heterogeneous parallel architecture; - an HW/SW partitioning of the application; - a mapping of the partitioned entities onto the proposed architecture.

More information ...
8.1 IoT Day Hot Topic Session: Challenges and Potentials for IoT Rollout

Date: Wednesday 29 March 2017
Time: 17:00 - 18:30
Location / Room: SBC

Organisers:
Marilyn Wolf, Georgia Tech, US
Andreas Herkersdorf, TU Muenchen, DE

Chair:
Andreas Herkersdorf, TU Muenchen, DE

Co-Chair:
Marilyn Wolf, Georgia Tech, US

Realizing the potential of IoT will require coordinated advances in multiple markets: applications, software systems, and VLSI. Understanding the requirements on IoT devices requires understanding the stack in which they operate. This session pulls together several points of view on the big picture of IoT rollout and their implications for device and system design.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>17:00</td>
<td>8.1.1</td>
<td>ULTRA-LOW POWER AND DEPENDABILITY FOR IOT DEVICES</td>
<td>Santiago Pagani, KIT Karlsruhe, DE&lt;br&gt;Joerg Henkel1, Santiago Pagani1, Hussam Amrouch1, Lars Bauer1 and Farzad Samie1&lt;br&gt;1Karlsruhe Institute of Technology, DE; 2Karlsruhe Institute of Technology (KIT), DE</td>
</tr>
<tr>
<td>17:30</td>
<td>8.1.2</td>
<td>SMARTER SPACES THROUGH LOCALIZED OBJECT INTERACTIONS</td>
<td>Jean-Marie Bonnin, Telecom Bretagne, FR&lt;br&gt;Jean-Marie Bonnin and Frédéric Weis, Telecom Bretagne, FR</td>
</tr>
<tr>
<td>18:00</td>
<td>8.1.3</td>
<td>DEPLOYING IOT FOR INSTRUMENTATION AND ANALYSIS OF MANUFACTURING SYSTEMS</td>
<td>Sujit Rokka Chhetri, UC Irvine, US&lt;br&gt;Mohammad Al Faruque, University of California Irvine, US</td>
</tr>
</tbody>
</table>

8.2 Hot Topic Session: No Power? No Problem! Exploiting Non-Volatility in Energy Constrained Environments

Date: Wednesday 29 March 2017
Time: 17:00 - 18:30

This talk will present a methodology to collect physical information (e.g., energy flows in the form of acoustics, vibration, electro-magnetic, etc.) effectively and efficiently from a manufacturing system using IoT infrastructure. Through applying information-theoretic analysis, we will show how to create a digital twin of the manufacturing system that may be used for process control (i.e., better decision making at different time-scales) and security. We will focus on the plug and play capability provided by the IoT, which will allow us to create digital twins of legacy manufacturing systems as well. We will demonstrate our work with an application in additive manufacturing system (3D printers). We will also present how in our recent work we have demonstrated that we can breach the confidentiality of a 3D printer by reconstructing an original 3D model from the printer's acoustic emission analysis.
With the rapid growth of the internet of things (IoT), demands for battery-less systems are ever increasing. Systems that can be powered by ambient energy sources would offer new opportunities and capabilities for personal entertainment, self-powered, computational systems have obvious societal benefits when deployed for medical monitoring, environmental sensing, etc. This hot topic session considers the current landscape of energy harvesting computing systems and highlights the need for power neutral systems. Subsequent presentations showcase emerging non-volatile memory and logic technologies that could enable battery-less computing systems.

<table>
<thead>
<tr>
<th>Time Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>17:00 8.2.1</td>
<td>ENERGY-DRIVEN COMPUTING: RETHINKING THE DESIGN OF ENERGY HARVESTING SYSTEMS</td>
<td>Geoff Merrett, University of Southampton, GB</td>
</tr>
<tr>
<td>17:30 8.2.2</td>
<td>NONVOLATILE PROCESSORS: WHY IS IT TRENDING?</td>
<td>Fang Su1, Kaisheng Ma2, Xueqing Li2, Tongda Wu1, Yongpan Liu1 and Vijaykrishnan Narayanan2</td>
</tr>
<tr>
<td>18:00 8.2.3</td>
<td>ADVANCED SPINTRONIC MEMORY AND LOGIC FOR NON-VOLATILE PROCESSORS</td>
<td>Robert Perricone1, Ibrahim Ahmed2, Zhaoxin Liang2, Meghna Mankalale3, X. Sharon Hu1, Chris H. Kim2, Michael Niemier1, Sachin Sapatnekar2 and Jian-Ping Wang2</td>
</tr>
</tbody>
</table>

**End of session**

### 8.3 Secure Processor Components

**Date:** Wednesday 29 March 2017  
**Time:** 17:00 - 18:30  
**Location / Room:** 2BC

**Chair:** Patrick Schaumont, Virginia Tech, US  
**Co-Chair:** Nele Mentens, Katholieke Universiteit Leuven, BE

Security concerns have put significant demands on hardware design of processors. In this session, papers will be presented that describe processor components designed to improve their performance, protect them more efficiently against side channel attacks and thereby improve the overall performance of processors used in secure applications.
17:00  8.3.1  AUTOMATIC GENERATION OF FORMALLY-PROVEN TAMPER-RESISTANT GALOIS-FIELD MULTIPLIERS BASED ON GENERALIZED MASKING SCHEME

Presenter:
Rei Ueno, Tohoku University, JP

Abstract:
In this study, we present a formal design system for tamper-resistant cryptographic hardwares based on Generalized Masking Scheme (GMS). The masking scheme is a state-of-the-art masking-based countermeasure against higher-order differential power analyses (DPAs), can securely construct any kind of Galois-field (GF) arithmetic circuits at the register transfer level description, while most other ones require specific physical design. In this study, we first present a formal design methodology of GMS-based GF-arithmetic circuits based on a hierarchical dataflow graph, called GF-arithmetic circuit graph (GF-ACG), and present a formal verification method for both functionality and security property based on Gröbner basis. In addition, we propose an automatic generation system for GMS-based GF multipliers, which can synthesize a fifth-order 256-bit multiplier (whose input bit-length is 256 times 7) within 15 min.

Download Paper (PDF; Only available from the DATE venue WiFi)

17:30  8.3.2  SCAM: SECURED CONTENT ADDRESSABLE MEMORY BASED ON HOMOMORPHIC ENCRYPTION

Speaker:
Song Bin, Kyoto University, JP

Abstract:
We propose an implementation of a secured content addressable memory (SCAM) based on homomorphic encryption (HE), where HE is used to compute the word matching function without the processor knowing what is being searched and the result of matching. By exploiting the shallow logic structure (XNOR followed by AND) of content addressable memory (CAM), we show that SCAM can be implemented with only additive homomorphism, greatly improving the efficiency of the HE algorithm. In the proposed method, the logic of homomorphic XNOR-AND is replaced with homomorphic XOR-OR, requiring only simple additions to be performed on the ciphertext. We also show that our scheme can be implemented by highly parallelizable and simple hardware architecture. Through experiment, we demonstrate that our software implementation is already 403x faster than the fastest known algorithm. With the help of hardware, we can achieve an energy reduction per word match by a factor of 477 million times, making our SCAM scheme much more practical.

Download Paper (PDF; Only available from the DATE venue WiFi)

18:00  8.3.3  SPARX - A SIDE-CHANNEL PROTECTED PROCESSOR FOR ARX-BASED CRYPTOGRAPHY

Speaker:
Florian Bache, University of Bremen, DE

Abstract:
ARX-based cryptographic algorithms are composed of only three elemental operations --- addition, rotation and exclusive or --- which are mixed to ensure adequate confusion and diffusion properties. While ARX-ciphers can easily be protected against timing attacks, special measures like masking have to be taken in order to prevent power and electromagnetic analysis. In this paper we present a processor architecture for ARX-based cryptography, that intrinsically guarantees first-order SCA resistance of any implemented algorithm. This is achieved by protecting the complete data path using a Boolean masking scheme with three shares. We evaluate our security claims by mapping an ARX-algorithm to the proposed architecture and using the common leakage detection methodology based on Student's t-test to certify the side-channel resistance of our processor.

Download Paper (PDF; Only available from the DATE venue WiFi)

18:30  End of session

8.4 Advanced systems for healthcare and assistive technologies

Date: Wednesday 29 March 2017
Time: 17:00 - 18:30
Location / Room: 3A
Chair: Ruben Braojos, EPFL, CH
Co-Chair: Luca Fanucci, University of Pisa, IT

This session focuses on embedded systems for human activity recognition and control. These systems combine flexible and dynamic hardware architectures with advanced novel signal processing techniques for activity recognition, myoelectric prosthesis control, motor intention decoding and brain computer interface. Finally, we will have two interactive presentations focused on embedded systems for diagnosis.

17:00  8.4.1  ADAPTIVE COMPRESSED SENSING AT THE FINGER TIP OF INTERNET-OF-THINGS SENSORS: AN ULTRA-LOW POWER ACTIVITY RECOGNITION

Presenter:
Ramin Fallahzadeh, University of Washington State University, US

Abstract:
With the proliferation of wearable devices in the Internet-of-Things applications, designing highly power-efficient solutions for continuous operation of these technologies in life-critical settings emerges. We propose a novel ultra-low power framework for adaptive compressed sensing in activity recognition. The proposed design uses a coarse-grained activity recognition module to adaptively tune the compressed sensing module for minimized sensing/transmission costs. We pose an optimization problem to minimize activity specific sensing rates and introduce a polynomial time approximation algorithm using a novel heuristic dynamic optimization tree. Our evaluations on real-world data shows that the proposed autonomous framework is capable of generating feed-back with >80% confidence and improves power reduction performance of the state-of-the-art approach by a factor of two.

Download Paper (PDF; Only available from the DATE venue WiFi)
The papers in this session discuss the use of learning as well as energy efficient circuit level implementation techniques for Neural Networks and for Green Computing in general.

8.4.2 A ZYNQ-BASED DYNAMICALLY RECONFIGURABLE HIGH DENSITY MYOELECTRIC PROSTHESIS CONTROLLER
Speaker: Linus Witschen, Paderborn University, DE
Authors: Alexander Boschmann1, Georg Thormansen1, Linus Witschen1, Alex Wiens1 and Marco Platzner2
1Paderborn University, DE; 2University of Paderborn, DE
Abstract
The combination of high-density electromyographic (HD EMG) sensor technology and modern machine learning algorithms allows for intuitive and robust prostheses control of multiple degrees of freedom. However, HD EMG real-time processing poses a challenge for common microprocessors in an embedded system. With the goal set on an autonomous prosthetic capable of performing training and classification of an amputee's HD EMG signals, the focus of this paper lies in the acceleration of the computationally expensive parts of the embedded signal processing chain: the feature extraction and classification. Using the Xilinx Zynq as a low-cost off-the-shelf system, we present a solution capable of processing 192 HD EMG channels with control delays below 120 milliseconds, suitable for highly responsive real-world prostheses control, achieving speed-ups up to 2.8 as compared to a software-only solution. Using dynamic FPGA reconfiguration, the system is able to trade off increased controller delay against improved classification accuracy when signal quality is decreased due to noisy channels. Offloading feature extraction and classification to the FPGA also reduced the system's power consumption, making it more suitable to be used in a battery-powered setup. The system was validated using real-time experiments with online HD EMG data from an amputee to control a state-of-the-art prosthesis.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:00 8.4.3 MICROWATT END-TO-END DIGITAL NEURAL SIGNAL PROCESSING SYSTEMS FOR MOTOR INTENTION DECODING
Speaker: Zhewei Jiang, Columbia University, US
Authors: Zhewei Jiang1, Chisung Bae2, Joongsong Kang1, Sang Joon Kim2 and Mingoo Seok1
1Columbia University, US; 2Samsung Electronics, KR
Abstract
This paper presents microwatt end-to-end digital signal processing (DSP) systems for deployment-stage real-time upper-limb movement intent prediction. This brain computer interface (BCI) DSP systems feature intercellular spike detection, sorting, and decoding operations for a 96-channel prosthetic implant. We design the algorithms for those operations to achieve minimal computation complexity while matching or advancing the accuracy of state-of-art BCI sorting and movement decoding. Based on those algorithms, we architect the DSP hardware with the focus on hardware reuse and event-driven operation. The VLSI implementation of the proposed architecture in a 65-nm high-VTH shows that it can achieve 7.7μW at the supply voltage of 300mV in the post-layout simulation. The area is 0.16 mm².
Download Paper (PDF; Only available from the DATE venue WiFi)

18:15 8.4.4 AN EMBEDDED SYSTEM REMOTELY DRIVING MECHANICAL DEVICES BY P300 BRAIN ACTIVITY
Speaker: Daniela De Venuto, Politecnico di Bari, IT
Authors: Valerio F. Annese1, Giovanni Mezzina2 and Daniela De Venuto2
1Politecnico di Bari, IT; 2Dept. of Electrical and Information Engineering, Politecnico di Bari, IT
Abstract
In this paper we present a P300-based Brain Computer Interface (BCI) for the remote control of a mechatronic actuator, such as wheelchair, or even a car, driven by EEG signals to be used by tetraplegic and paralytic users or just for safe drive in case of car. The P300 signal, an Evoked Related Potential (ERP) devoted to the cognitive brain activity, is induced for purpose by visual stimulation. The EEG data are collected by 6 smart wireless electrodes from the parietal-cortex area and online classified by a linear threshold classifier, based on a suitable stage of Machine Learning (ML). The ML is implemented on a µPC dedicated to the system and where the data acquisition and processing is performed. The main improvement in remote driving car by EEG, regards the approach used for the intentions recognition. In this work, the classification is based on the P300 and not just on the average of more not well identify potentials. This approach reduces the number of electrodes on the EEG helmet. The ML stage is based on a custom algorithm (i-RIDE) which tunes the following classification stage on the user's “cognitive chronometry”. The ML algorithm starts with a fast calibration phase (just ~190s for the first learning). Furthermore, the BCI presents a functional approach for time-domain features extraction, which reduces the amount of data to be analyzed, and then the system response times. In this paper, a proof of concept of the proposed BCI is shown using a prototype car, tested on 5 subjects (aged 26 ± 3). The experimental results show that the novel ML approach allows a complete P300 spatio-temporal characterization in 1.95s using 38 target brain visual stimuli (for each direction of the car path). In free-drive mode, the BCI classification reaches 80.5 ± 4.1% on single-trial detection accuracy while the worst-case computational time is 19.65ms ± 10.1. The BCI system here described can be also used on different mechatronic actuators, such as robots.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:31 1P4-1, 91
1024-CHANNEL 3D ULTRASOUND DIGITAL BEAMFORMER IN A SINGLE 5W FPGA
Speaker: Aya Ibrahim, EPFL, CH
Authors: Federico Angiolini1, Aya Ibrahim1, William Simon1, Ahmet Caner Yüzgüller1, Marcel Arditi1, Jean-Philippe Thiran1 and Giovanni De Micheli2
1EPFL, CH; 2École Polytechnique Fédérale de Lausanne (EPFL), CH
Abstract
3D ultrasound, an emerging medical imaging tech- nique that is presently only used in hospitals, has the potential to enable breakthrough telemedicine applications, provided that its cost and power dissipation can be minimized. In this paper, we present a FPGA architecture suitable for a portable medical 3D ultrasound device. An optimized design for the digital part of the imager, including the delay calculation block, which is its most critical part. Our computationally efficient approach requires a single FPGA for 3D imaging, which is unprecedented. The design is scalable; a configuration supporting a 32x32-channel probe, which enables high-quality imaging, consumes only about 5W.
Download Paper (PDF; Only available from the DATE venue WiFi)

8.5 Learning and Resilience Techniques for Green Computing
Date: Wednesday 29 March 2017
Time: 17:00 - 18:30
Location / Room: 3C
Chair: Muhammed Shafique, Vienna University of Technology (TU-Wien), AT
Co-Chair: Andreas Burg, EPFL, CH
The papers in this session discuss the use of learning as well as energy efficient circuit level implementation techniques for Neural Networks and for Green Computing in general.
17:00 8.5.1 REVAMPING TIMING ERROR RESILIENCE TO TACKLE CHOKE POINTS AT NTC SYSTEMS
Speaker: Aatreyi Bal, USU Bridge Lab, Utah State University, US
Authors: Aatreyi Bal, Shamik Saha, Sanghamitra Roy and Koushik Chakraborty, Utah State University, US
Abstract
In this paper, we illustrate “choke points” as a vital consequence of process variation in the Near Threshold Computing (NTC) domain. Choke points are sensitized logic gates with increased delay deviation, due to process variation.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30 8.5.2 EFFICIENT NEURAL NETWORK ACCELERATION ON GPGPU USING CONTENT ADDRESSABLE MEMORY
Speaker: Tajana Rosing, University of California at San Diego, US
Authors: Mohsen Imani1, Daniel Peroni1, Yeseong Kim1, Abbas Rahimi2 and Tajana Rosing 3
1University of California San Diego, US; 2University of California Berkeley, US; 3UCSD, US
Abstract
Recently, neural networks have been demonstrated to be effective models for image processing, video segmentation, speech recognition, computer vision and gaming. However, high computation energy and low performance are the primary bottlenecks of running the neural networks. In this paper, we propose an energy/performance-efficient network acceleration technique on General Purpose GPU (GPGPU) architecture which utilizes specialized resistive nearest content addressable memory blocks, called NNCAM, by exploiting computation locality of the learning algorithms. NNCAM stores high frequency patterns corresponding to neural network operations and searches for the most similar patterns to reuse the computation results. To improve NNCAM computation efficiency and accuracy, we proposed layer-based associative update and selective approximation techniques. The layer-based update improves data locality of NNCAM blocks by filling NNCAM values based on the frequent computation patterns of each neural network layer. To guarantee the appropriate level of computation accuracy while providing maximum energy saving, our design adaptively allocates the neural network operations to either NNCAM or GPGPU floating point units (FPUs). The selective approximation relaxes computation on neural network layers by considering the impact on accuracy. In evaluation, we integrate NNCAM blocks with the modern AMD Southern Island GPU architecture. Our experimental evaluation shows that the enhanced GPGPU can result in 68% energy savings and 40% speedup running on four popular convolutional neural networks (CNN), ensuring acceptable <2% quality loss.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:00 8.5.3 CHAIN-NN: AN ENERGY-EFFICIENT 1D CHAIN ARCHITECTURE FOR ACCELERATING DEEP CONVOLUTIONAL NEURAL NETWORKS
Speaker: Shihao Wang, Waseda University, JP
Authors: Shihao Wang, Dajiang Zhou, Xusen Han and Yoshimura Takeshi, Waseda University, JP
Abstract
Deep convolutional neural networks (CNN) have shown their good performances in many computer vision tasks. However, the high computational complexity of CNN involves a huge amount of data flows. The computational process core and memory hierarchy which occupies the major of the power consumption. This paper presents Chain-NN, a novel energy-efficient 1D chain architecture for accelerating deep CNNs. Chain-NN consists of the dedicated dual-channel process engines (PE). In Chain-NN, convolutions are done by the 1D systolic primitives composed of a group of adjacent PEs. These systolic primitives, together with the proposed column-wise scan input pattern, can fully reuse input operand to reduce the memory bandwidth requirement for energy saving. Moreover, the 1D chain architecture allows the systolic primitives to be easily reconfigured according to specific CNN parameters with fewer design complexity. The synthesis and layout of Chain-NN is under TSMC 28nm process. It costs 3751k logic gates and 352KB on-chip memory. The results show a 576-PE Chain-NN can be scaled up to 700MHz. This achieves a peak throughput of 806.4GOPS with 567.5mW and is able to accelerate the five convolutional layers in AlexNet at a frame rate of 362.2fps. 1421.0GOPS/W power efficiency is at least 2.5x to 4.1x times better than the state-of-the-art works.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:15 8.5.4 CONTINUOUS LEARNING OF HPC INFRASTRUCTURE MODELS USING BIG DATA ANALYTICS AND IN-MEMORY PROCESSING TOOLS
Speaker: Francesco Beneventi, Università di Bologna, IT
Authors: Francesco Beneventi1, Andrea Bartolini2, Carlo Cavazzoni3 and Luca Benini2
1DEI - University of Bologna, IT; 2Università di Bologna, IT; 3Cineca, IT
Abstract
Exascale computing represents the next leap in the HPC race. Reaching this level of performance is subject to several engineering challenges such as energy consumption, equipment-cooling, reliability and massive parallelism. Model-based optimization is an essential tool in the design process and control of energy efficient, reliable and thermally constrained systems. However, in the Exascale domain, model learning techniques tailored to the specific supercomputer require real measurements and must therefore handle and analyze a massive amount of data coming from the HPC monitoring infrastructure. This becomes rapidly a big data scale problem. The common approach where measurements are first stored in large databases and then processed is no more affordable due to the increasingly storage costs and lack of real-time support. Nowadays instead, cloud-based machine learning techniques aim to build on-line models using real-time approaches such as “stream processing” and “in-memory” computing, that avoid storage costs and enable fast-data processing. Moreover, the fast delivery and adaptation of the models to the quick data variations, make the decision stage of the optimization loop more effective and reliable. In this paper we leverage scalable, lightweight and flexible IoT technologies, such as the MQTT protocol, to build a highly scalable HPC monitoring infrastructure able to handle the massive sensor data produced by next-gen HPC components. We then show how state-of-the-art tools for big data computing and analysis, such as Apache Spark, can be used to manage the huge amount of data delivered by the monitoring layer and to build adaptive models in real-time using on-line machine learning techniques.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30 463 LAANT: A LIBRARY TO AUTOMATICALLY OPTIMIZE EDP FOR OPENMP APPLICATIONS
Speaker: Arthur Francisco Lorenzon, Federal University of Rio Grande do Sul, BR
Authors: Arthur Lorenzon, Jackson Dellagostin Souza and Antonio Carlos Schneider Beck Filho, Universidade Federal do Rio Grande do Sul, BR
Abstract
Efficiently exploiting thread level parallelism gains is the main challenge for software developers. While blindly increasing the number of threads may lead to performance gains, it can also result in disproportionate increase in energy consumption. For this reason, rightly choosing the number of threads is essential to reach the best compromise between both. However, such task is extremely difficult: besides the huge number of variables involved, many of them will change according to different aspects of the system at hand and are only possible to be defined at run-time. To address this complex scenario, we propose LAANT, a novel library to automatically find the optimal number of threads for OpenMP applications, by dynamically considering their particular characteristics, input set, and the processor architecture. By executing nine well-known benchmarks on three real multicore processors, LAANT improves the EDP (Energy-Delay Product) by up to 61%, compared to the standard OpenMP execution; and by 44%, when the dynamic adjustment of the number of threads of OpenMP is activated.
Download Paper (PDF; Only available from the DATE venue WiFi)
In healthcare, effective monitoring of patients plays a key role in detecting health deterioration early enough. Many signs of deterioration exist as early as 24 hours prior having a serious impact on the health of a person. As hospitalization times have to be minimized, in-home or remote early warning systems can fill the gap by allowing in-home care while having the potentially problematic conditions and their signs under surveillance and control. This work presents a remote monitoring and diagnostic system that provides a holistic perspective of patients and their health conditions. We discuss how the concept of self-awareness can be used in various parts of the system such as information collection through wearable sensors, confidence assessment of the sensory data, the knowledge base of the patient’s health situation, and automation of reasoning about the health situation. Our approach to self-awareness provides (i) situation awareness to consider the impact of variations such as sleeping, walking, running, and resting, (ii) system personalization by reflecting parameters such as age, body mass index, and gender, and (iii) the attention property of self-awareness to improve the energy efficiency and dependability of the system via adjusting the priorities of the sensory data collection. We evaluate the proposed method using a full system demonstration.

Download Paper (PDF; Only available from the DATE venue WiFi)
The first paper in this session presents a novel open-source hardware/software infrastructure for dynamic binary translation. The second paper presents a mechanism to improve the instruction-level and thread-level parallelism in embedded systems. The third paper presents a WCET analysis for multiple tasks on single-core systems.

### 8.7 Instruction-level and thread-level parallelism in embedded systems

**Date:** Wednesday 29 March 2017  
**Time:** 17:00 - 18:30  
**Location / Room:** 3B

**Chair:**  
Oliver Bringmann, Universität Tübingen, DE

**Co-Chair:**  
Jurgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE

The first paper in this session presents a novel open-source hardware/software infrastructure for dynamic binary translation. The second paper presents a mechanism to improve the instruction-level and thread-level parallelism in embedded systems. The third paper presents a WCET analysis for multiple tasks on single-core systems.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
</table>
| 17:00 | 8.7.1 | HARDWARE-ACCELERATED DYNAMIC BINARY TRANSLATION | Simon Rokicki, Université de Rennes 1 / IRISA, FR  
Simon Rokicki\(^1\), Erven Rohou\(^2\) and Steven Derrien\(^1\)  
\(^1\)IRISA, FR; \(^2\)IRISA, FR  
**Abstract**  
Dynamic Binary Translation (DBT) is often used in hardware/software co-design to take advantage of an architecture model while using binaries from another one. The co-development of the DBT engine and of the execution architecture leads to architecture with special support to these mechanisms. In this work, we propose a hardware accelerated Dynamic Binary Translation where the first steps of the DBT process are fully accelerated in hardware. Results shows that using our hardware accelerators leads to a speed-up of 8x and a cost in energy 18x lower, compared with an equivalent software approach.  
Download Paper (PDF; Only available from the DATE venue WiFi) |
| 17:30 | 8.7.2 | SUPERWORD LEVEL PARALLELISM AWARE WORD LENGTH OPTIMIZATION | Ali Hassan El Moussawi, IRISA, FR  
Ali Hassan El Moussawi\(^1\) and Steven Derrien\(^2\)  
\(^1\)IRISA, FR; \(^2\)IRISA, FR  
**Abstract**  
Many embedded processors do not support floating-point arithmetic in order to comply with strict cost and power consumption constraints. But, they generally provide support for SIMD as a mean to improve performance for little cost overhead. Achieving good performance when targeting such processors requires the use of fixed-point arithmetic and efficient exploitation of SIMD data-path. To reduce time-to-market, automatic SIMDization -- such as superword level parallelism (SLP) extraction -- and floating-point to fixed-point conversion methodologies have been proposed. In this paper we show that applying these transformations independently is not efficient. We propose a SLP-aware word length optimization algorithm to jointly perform float-to-fixed-point conversion and SLP extraction. We implement the proposed approach in a source-to-source compiler framework and evaluate it on several embedded processors. Experimental results illustrate the validity of our approach.  
Download Paper (PDF; Only available from the DATE venue WiFi) |
| 18:00 | 8.7.3 | SCHEDULABILITY-AWARE SPM ALLOCATION FOR PREEMPTIVE HARD REAL-TIME SYSTEMS WITH ARBITRARY ACTIVATION PATTERNS | Arno Luppold, Hamburg University of Technology, DE  
Arno Luppold\(^1\) and Heiko Falk\(^2\)  
\(^1\)Hamburg University of Technology, DE; \(^2\)Hamburg University of Technology (TUHH), DE  
**Abstract**  
In hard real-time multi-tasking systems each task has to meet its deadline under any circumstances. If one or several tasks violate their timing constraints, compiler optimizations can be used to optimize the Worst-Case Execution Time (WCET) of each task with a focus on the system’s schedulability. Existing approaches are limited to single-tasking or strictly periodic multi-tasking systems. We propose a compiler optimization to perform a schedulability-aware static instruction Scratchpad Allocation for arbitrary activation patterns and deadlines. The approach is based on Integer-Linear Programming and is evaluated for the Infineon TriCore TCI796 microcontroller.  
Download Paper (PDF; Only available from the DATE venue WiFi) |
| 18:30 | 8.7.4 | SCHEDULE-AWARE LOOP PARALLELIZATION FOR EMBEDDED MPSoCs BY EXPLOITING PARALLEL SLACK | Miguel Angel Aguilar, RWTH Aachen University, DE  
Miguel Angel Aguilar\(^1\), Rainer Leupers\(^1\), Gerd Ascheid\(^1\), Nikolaos Kavvadias\(^2\) and Liam Fitzpatrick\(^2\)  
\(^1\)RWTH Aachen University, DE; \(^2\)Silexica Software Solutions GmbH, DE  
**Abstract**  
MPSoC programming is still a challenging task, where several aspects have to be taken into account to achieve a profitable parallel execution. Selecting a proper scheduling policy is an aspect that has a major impact on the performance. OpenMP is an example of a programming paradigm that allows to specify the scheduling policy on a per loop basis. However, choosing the best scheduling policy and the corresponding parameters is not a trivial task. In fact, there is already a large amount of software parallelized with OpenMP, where the scheduling policy is not explicitly specified. Then, the scheduling decision is left to the default runtime, which in most of the cases does not yield the best performance. In this paper, we present a schedule-aware optimization approach enabled by exploiting the parallel slack existing in loops parallelized with OpenMP. Results on an embedded multicore device, show that the performance achieved by OpenMP loops optimized with our approach outperform by up to 93%, the performance achieved by the original OpenMP loops, where the scheduling policy is not specified.  
Download Paper (PDF; Only available from the DATE venue WiFi) |
ticket: CHF 80.00 per person.

be booked during the online registration process though). Additional tickets can be purchased on-site at the registration desk (subject to availability of tickets). Price for extra

Please kindly note that it is not a seated dinner. Drinks and snacks (flying buffet) will be served in the TOM Café.

During the evening, all delegates will have the chance to visit the different expositions for free.

history, culture, design, sociology, and technology.

exhibition space and a new scenography which perfectly reflects the idea and spirit behind and how rich and diverse Olympism is. Some of the themes highlighted include sports,

The party is scheduled on March 29, 2017, from 1900 to 2300, and will take place in Lausanne's most outstanding museum location: The Olympic Museum. It is beautifully located

amenities. Thus, it states one of the main networking opportunities during the DATE week.

Success in this kind of business requires strong technical skills, capacity to deal with high risk of failure, and extremely hard work. In this panel we will discuss which are the
difficult path, and only small number of people are successful.

Technology entrepreneurship implicates taking a technology idea and finding a high-potential commercial opportunity, gathering resources such as talent and capital, considering
how to market the idea, and managing rapid growth. It is a very high-potential path with a chance of both high earnings and large direct impact. However, it is also a really
difficult path, and only small number of people are successful.

Success in this kind of business requires strong technical skills, capacity to deal with high risk of failure, and extremely hard work. In this panel we will discuss which are the
challenges, opportunities and risks of creating technology startups.

End of session

End of session
### 9.1 Wearable and Smart Medical Devices Day: New tools and devices for chronic and acute care

**Date:** Thursday 30 March 2017  
**Time:** 08:30 - 10:00  
**Location / Room:** 5BC

**Organisers:**  
José L. Ayala, Universidad Complutense de Madrid, ES  
Chris Van Hoof, IMEC, BE

**Chair:**  
José L. Ayala, Universidad Complutense de Madrid, ES

**Co-Chair:**  
Mario Konijnenburg, IMEC, BE

This session will present the recent advances in medical devices for the clinical practice. We will attend how Industry and Academia work on designing novel wearable, ASICs and computational systems that help on promoting the novel healthcare paradigms in the treatment of chronic and acute diseases.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>08:30</td>
<td>9.1.1</td>
<td>WEARABLE ROBOTICS IN CLINICAL PRACTICE: PROSPECTS</td>
<td>José Luis Pons, CSIC, ES</td>
</tr>
<tr>
<td>09:00</td>
<td>9.1.2</td>
<td>OVERCOMING HEARING LOSS THROUGH NEW IMPLANT TECHNOLOGIES</td>
<td>Carl Van Himbeeck, Cochlear Technology Centre, BE</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Hearing loss is a big unmet medical need. There is a significant and growing group of people with significant hearing loss who could benefit from implant technologies. A broad range of implant and clinical solutions are developed to improve the access to the users and the professionals.</td>
<td></td>
</tr>
<tr>
<td>09:30</td>
<td>9.1.3</td>
<td>CIRCUITS AND SYSTEMS AS ENABLERS FOR NOVEL HEALTHCARE PARADIGMS</td>
<td>Mario Konijnenburg, imec, BE</td>
</tr>
</tbody>
</table>

10:00  
End of session  
Coffee Break in Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

- Tuesday, March 28, 2017  
  - Coffee Break 10:30 - 11:30  
  - Coffee Break 16:00 - 17:00

- Wednesday, March 29, 2017  
  - Coffee Break 10:00 - 11:00  
  - Coffee Break 16:00 - 17:00

- Thursday, March 30, 2017  
  - Coffee Break 10:00 - 11:00  
  - Coffee Break 15:30 - 16:00

### 9.2 Emerging Schemes for Memory Management

**Date:** Thursday 30 March 2017  
**Time:** 08:30 - 10:00  
**Location / Room:** 4BC

**Chair:**  
Arne Heittman, RWTH, DE

**Co-Chair:**  
Costin Anghel, ISEP, FR

This topic covers aspects of emerging memory architectures and functional blocks with respect to performance and endurance enhancement. In particular, caches, FTL, logic-in-memory and error correction schemes covering strategies like error correction wear leveling and cache replacement are covered. NVMs like PCM, Flash and RRAMs are considered in this track.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>08:30</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>09:00</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>09:30</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

10:00  
End of session  
Coffee Break 10:30 - 11:30  
Coffee Break 16:00 - 17:00
A LOG-AWARE SYNERGIZED SCHEME FOR PAGE-LEVEL FTL DESIGN

Authors:
Chu Li, Huazhong University of Science & Technology, CN

Abstract
NAND flash-based Solid State Drives (SSDs) employ the Flash Translation Layer (FTL) to perform logical-to-physical address translation. Modern page-level FTLs selectively cache the address mappings in the limited SRAM while storing the mapping table in flash pages (called translation pages). However, many extra accesses to the translation pages are required for address translation, which decreases the performance and lifetime of an SSD. In this paper, we propose a Log-aware Synergized scheme for page-level FTL to reduce the extra overheads, called LSFTL. The contribution of LSFTL consists of two key elements: (i) By exploiting garbage collection overhead via reserving a small portion of each translation page as a logging area to hold multiple updates to the entries of that translation page. (ii) "Log-aware flash back" reduces the number of translation page updates by evicting multiple dirty cache lines that share the same translation page in a single transaction. Extensive experimental results of trace-driven simulations show that LSFTL decreases the system response time by 39.40% on average, and up to 58.35%, and reduces the block erase count by 37.55% on average, and up to 39.99%, compared to the well-known DFTL.

MALRU: MISS-PENALTY AWARE LRU-BASED CACHE REPLACEMENT FOR HYBRID MEMORY SYSTEMS

Authors:
Chen Di, Huazhong University of Science and Technology, CN

Abstract
Current DRAM based memory systems face the scalability challenges in terms of storage density, power, and cost. Hybrid memory architecture composed of emerging Non-Volatile Memory (NVM) and DRAM is a promising approach to large-capacity and energy-efficient main memory. However, hybrid memory systems pose a new challenge to on-chip cache management due to the asymmetrical penalty of memory access to DRAM and NVM in case of cache misses. Cache hit rate is no longer an effective metric for evaluating memory access performance in hybrid memory systems. Current cache replacement policies that aim to improve cache hit rate are not efficient either. In this paper, we take into account the asymmetry of cache miss penalty on DRAM and NVM, and advocate a more general metric, Average Memory Access Time (AMAT), to evaluate the performance of hybrid memories. We propose a miss penalty-aware LRU-based (MALRU) cache replacement policy for hybrid memory systems. MALRU is aware of the source (DRAM or NVM) of missing blocks and prevents high-latency NVM blocks as well as low-latency DRAM blocks with good temporal locality from being evicted. Experimental results show that MALRU improves system performance against LRU and the state-of-the-art HAP policy by up to 20.4% and 11.7% (11.1% and 5.7% on average), respectively.

ENDURANCE MANAGEMENT FOR RESISTIVE LOGIC-IN-MEMORY COMPUTING ARCHITECTURES

Authors:
Saeideh Shirinzadeh, University of Bremen, DE

Abstract
Resistive Random Access Memory (RRAM) is a promising non-volatile memory technology which enables modern in-memory computing architectures. Although RRAMs are known to be superior to conventional memories in many aspects, they suffer from a low write endurance. In this paper, we focus on balancing memory write traffic as a solution to extend the lifetime of resistive crossbar architectures. As a case study, we monitor the write traffic in a Programmable Logic-in-Memory (PLiM) architecture, and propose an endurance management scheme for it. The proposed endurance-aware compilation is capable of handling different trade-offs between write balance, latency, and area of the resulting PLiM implementations. Experimental evaluations on a set of benchmarks including large arithmetic and control functions show that the standard deviation of writes can be reduced by 86.65% on average compared to a naive compiler, while the average number of instructions and RRAM devices also decreases by 36.45% and 13.67%, respectively.

MALRU: MISS-PENALTY AWARE LRU-BASED CACHE REPLACEMENT FOR HYBRID MEMORY SYSTEMS

Authors:
Chen Di, Huazhong University of Science and Technology, CN

Abstract
Current DRAM based memory systems face the scalability challenges in terms of storage density, power, and cost. Hybrid memory architecture composed of emerging Non-Volatile Memory (NVM) and DRAM is a promising approach to large-capacity and energy-efficient main memory. However, hybrid memory systems pose a new challenge to on-chip cache management due to the asymmetrical penalty of memory access to DRAM and NVM in case of cache misses. Cache hit rate is no longer an effective metric for evaluating memory access performance in hybrid memory systems. Current cache replacement policies that aim to improve cache hit rate are not efficient either. In this paper, we take into account the asymmetry of cache miss penalty on DRAM and NVM, and advocate a more general metric, Average Memory Access Time (AMAT), to evaluate the performance of hybrid memories. We propose a miss penalty-aware LRU-based (MALRU) cache replacement policy for hybrid memory systems. MALRU is aware of the source (DRAM or NVM) of missing blocks and prevents high-latency NVM blocks as well as low-latency DRAM blocks with good temporal locality from being evicted. Experimental results show that MALRU improves system performance against LRU and the state-of-the-art HAP policy by up to 20.4% and 11.7% (11.1% and 5.7% on average), respectively.

ENDURANCE MANAGEMENT FOR RESISTIVE LOGIC-IN-MEMORY COMPUTING ARCHITECTURES

Authors:
Saeideh Shirinzadeh, University of Bremen, DE

Abstract
Resistive Random Access Memory (RRAM) is a promising non-volatile memory technology which enables modern in-memory computing architectures. Although RRAMs are known to be superior to conventional memories in many aspects, they suffer from a low write endurance. In this paper, we focus on balancing memory write traffic as a solution to extend the lifetime of resistive crossbar architectures. As a case study, we monitor the write traffic in a Programmable Logic-in-Memory (PLiM) architecture, and propose an endurance management scheme for it. The proposed endurance-aware compilation is capable of handling different trade-offs between write balance, latency, and area of the resulting PLiM implementations. Experimental evaluations on a set of benchmarks including large arithmetic and control functions show that the standard deviation of writes can be reduced by 86.65% on average compared to a naive compiler, while the average number of instructions and RRAM devices also decreases by 36.45% and 13.67%, respectively.

Download Paper (PDF; Only available from the DATE venue WiFi)
Muhammad Shafique, CARE-Tech, TU Wien, AT

ON the goal of this special session is to revisit the depth and breadth of CPS security, with focus on practical system and design automation aspects. In a practical system, the possible sources of security vulnerabilities and recent attacks are discussed, and it is argued that there are significant varieties of attacks that need to be accounted for in a holistic manner.
<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>08:30</td>
<td>9.3.1</td>
<td>SECURE CYBER-PHYSICAL SYSTEMS: CURRENT TRENDS, TOOLS AND OPEN RESEARCH PROBLEMS</td>
<td>Anupam Chattopadhyay, Nanyang Technological University, SG</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speaker:</td>
<td>Authors:</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Anupam Chattopadhyay¹, Alok Prakash¹ and Muhammad Shafique²</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>¹Nanyang Technological University, SG; ²Vienna University of Technology (TU Wien), AT</td>
<td></td>
</tr>
<tr>
<td>08:45</td>
<td>9.3.2</td>
<td>DON'T FALL INTO A TRAP: PHYSICAL SIDE-CHANNEL ANALYSIS OF CHACHA20-POLY1305</td>
<td>Bernhard Jungk, Temasek Laboratories @ Nanyang Technological University, SG</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speaker:</td>
<td>Authors:</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Bernhard Jungk¹ and Shvamin Bhass²</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>¹Temasek Laboratories @ Nanyang Technological University, SG; ²Til@NTU, SG</td>
<td></td>
</tr>
<tr>
<td>09:00</td>
<td>9.3.3</td>
<td>THE ROWHAMMER PROBLEM AND OTHER ISSUES WE MAY FACE AS MEMORY BECOMES DENSER</td>
<td>Onur Mutlu, ETH Zurich, CH</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speaker and Author:</td>
<td>Authors:</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Nisha Jacob, Fraunhofer AISEC, DE</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Nisha Jacob¹, Carsten Rolfes¹, Andreas Zankl¹, Johann Heyssl¹ and Georg Sigl²</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>¹Fraunhofer Institute for Applied and Integrated Security (AISEC), DE; ²Technische Universität München, DE</td>
<td></td>
</tr>
<tr>
<td>09:15</td>
<td>9.3.4</td>
<td>COMPROMISING FPGA SOCS USING MALICIOUS HARDWARE BLOCKS</td>
<td>Nisha Jacob, Temasek Laboratories, SG</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speaker:</td>
<td>Authors:</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Nisha Jacob¹, Carsten Rolfes¹, Andreas Zankl¹, Johann Heyssl¹ and Georg Sigl²</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>¹Fraunhofer Institute for Applied and Integrated Security (AISEC), DE; ²Technische Universität München, DE</td>
<td></td>
</tr>
<tr>
<td>09:30</td>
<td>9.3.5</td>
<td>INSPIRING TRUST IN OUTSOURCED INTEGRATED CIRCUIT FABRICATION</td>
<td>Siddharth Garg, New York University, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speaker and Author:</td>
<td>Authors:</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Siddharth Garg, New York University, US</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Siddharth Garg, New York University, US</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Siddharth Garg, New York University, US</td>
<td></td>
</tr>
<tr>
<td>09:45</td>
<td>9.3.6</td>
<td>ANALYZING SECURITY BREACHES OF COUNTERMEASURES THROUGHOUT THE REFINEMENT PROCESS IN HARDWARE DESIGN FLOW</td>
<td>Sylvain Guille, Jean-Luc Danger, Philippe Nguyen, Robert Nguyen and Youssef Souissi, Secure-IC S.A.S., FR</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Speaker:</td>
<td>Authors:</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Jean-Luc Danger, Secure-IC, FR</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Sylvain Guille, Jean-Luc Danger, Philippe Nguyen, Robert Nguyen and Youssef Souissi, Secure-IC S.A.S., FR</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Side-channel and fault injection attacks are two threats on devices carrying sensitive information. Protections are thus implemented at design time. However, CAD (Computer Aided Design) tools can compromise them, in ways we detail pedagogically in this paper. Then, we explain how a simulation-based methodology allows to check for non-regression, and find problems in case some are introduced while refining the design description from RTL (Register Transfer Level) source code to GDS (Graphic Display System) stream format.</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Download Paper (PDF; Only available from the DATE venue WiFi)</td>
<td></td>
</tr>
</tbody>
</table>
This session features methods that extract desired implementation options from the huge design space of digital systems. The first talk presents a method to pick valuable operating points from a Pareto optimal set of task mappings for an efficient online resource management. The second presentation presents a rapid estimation framework to evaluate performance/area metrics of various accelerator options for an application at an early design phase. A design space exploration for implementing convolutional layers of operating points from a Pareto optimal set of task mappings for an efficient online resource management. The session concludes with two short introductions of interactive presentations.

**DESIGN SPACE EXPLORATION OF FPGA-BASED ACCELERATORS WITH MULTI-LEVEL PARALLELISM**

**Authors:**
Behnaz Pourmohseni, Friedrich-Alexander-Universität Erlangen–Nürnberg, DE
Michael Glaß, Ulm University, DE
Jürgen Teich, Ulm University, DE

**Abstract**
Efficient execution of applications on heterogeneous many-core platforms requires mapping solutions that address different aspects of run-time dynamism like resource availability, energy budgets, and timing requirements. Hybrid mapping methodologies employ a static design space exploration (DSE) to obtain a set of mapping alternatives termed operating points that trade off quality properties (compute performance, energy consumption, etc.) and resource requirements (number of allocated resources of each type, etc.) among which one is selected at run-time by a run-time resource manager (RRM). Given multiple quality properties and the presence of heterogeneous resources, the DSE typically delivers a substantially large set of operating points handling of which may impose an intolerable run-time overhead to the RRM. This paper investigates the problem of truncation of operating points termed operating point distillation, such that (a) an acceptable run-time overhead is achieved, (b) on-line quality requirements are met, and (c) dynamic resource constraints are satisfied, i.e., application embeddability is preserved. We propose an automatic design-time distillation methodology that employs a hyper grid-based approach to retain diverse trade-off options wrt. quality properties, while selecting representative operating points based on their resource requirements to achieve a high level of run-time embeddability. Experimental results for a variety of applications show that compared to existing truncation approaches, proposed methodology significantly enhances the run-time embeddability while achieving a competitive and often improved efficiency in the distilled quality properties.

**Download Paper (PDF; Only available from the DATE venue WiFi)**

**AUTOMATIC OPERATING POINT DISTILLATION FOR HYBRID MAPPING METHODOLOGIES**

**Authors:**
Behnaz Pourmohseni, Friedrich-Alexander-Universität Erlangen–Nürnberg, DE
Michael Glaß, Ulm University, DE
Jürgen Teich, Ulm University, DE

**Abstract**
Efficient execution of applications on heterogeneous many-core platforms requires mapping solutions that address different aspects of run-time dynamism like resource availability, energy budgets, and timing requirements. Hybrid mapping methodologies employ a static design space exploration (DSE) to obtain a set of mapping alternatives termed operating points that trade off quality properties (compute performance, energy consumption, etc.) and resource requirements (number of allocated resources of each type, etc.) among which one is selected at run-time by a run-time resource manager (RRM). Given multiple quality properties and the presence of heterogeneous resources, the DSE typically delivers a substantially large set of operating points handling of which may impose an intolerable run-time overhead to the RRM. This paper investigates the problem of truncation of operating points termed operating point distillation, such that (a) an acceptable run-time overhead is achieved, (b) on-line quality requirements are met, and (c) dynamic resource constraints are satisfied, i.e., application embeddability is preserved. We propose an automatic design-time distillation methodology that employs a hyper grid-based approach to retain diverse trade-off options wrt. quality properties, while selecting representative operating points based on their resource requirements to achieve a high level of run-time embeddability. Experimental results for a variety of applications show that compared to existing truncation approaches, proposed methodology significantly enhances the run-time embeddability while achieving a competitive and often improved efficiency in the distilled quality properties.

**Download Paper (PDF; Only available from the DATE venue WiFi)**

**DESIGN SPACE EXPLORATION OF FPGA-BASED ACCELERATORS WITH MULTI-LEVEL PARALLELISM**

**Authors:**
Guanwen Zhong, National University of Singapore, SG
Alok Prakash, Alok Prakash, Nanyang Technological University, SG
Siqi Wang, Yun (Eric) Liang, Peking University, CN
Tulika Mitra, Smail Niar, LAMIH-University of Valenciennes, FR

**Abstract**
Applications containing compute-intensive kernels with nested loops can effectively leverage FPGAs to exploit fine- and coarse-grained parallelism. HLS tools used to translate these kernels from high-level languages (e.g., C/C++), however, are inefficient in exploiting multiple levels of parallelism automatically, thereby producing sub-optimal accelerators. Moreover, the large design space resulting from the various combinations of fine- and coarse-grained parallelism options makes exhaustive design space exploration prohibitively time-consuming with HLS tools. Hence, we propose a rapid estimation framework, MPSeeker, to evaluate performance/area metrics of various accelerator options for an application at an early design phase. Experimental results show that MPSeeker can rapidly (in minutes) explore the complex design space and accurately estimate performance/area of various design points to identify the near-optimal (95.7% performance of the optimal on average) combination of parallelism options.

**Download Paper (PDF; Only available from the DATE venue WiFi)**
### 9.4.3 Design Space Exploration of FPGA Accelerators for Convolutional Neural Networks

**Speaker:** Jongeun Lee, UNIST, KR

**Authors:**
- Atul Rahman1, Sangyun Oh2, Jongeun Lee3, and Kiyoung Choi4
- Samsung Electronics, KR; 1UNIST, KR; 2Ulsan National Institute of Science and Technology (UNIST), KR; 4Seoul National University, KR

**Abstract**

This paper presents a new schedulability test for safety-critical software undergoing a transition from single-core to multicore systems - a challenge faced by multiple industries today. Our migration model consists of a schedulability test and execution model. Its properties enable us to obtain a utilization bound that places an allowable limit on total task execution times. Evaluation results demonstrate the advantages of our scheduling model over competing resource partitioning approaches, such as Periodic Server and TDMA.

Download Paper (PDF; Only available from the DATE venue WiFi)

### 9.4.4 A Slack-Based Approach to Efficiently Deploy Radix 8 Booth Multipliers

**Speaker:** Alberto Antonio Del Barrio García and Hermida Roman, Complutense University of Madrid, ES

**Authors:**
- Alberto Antonio Del Barrio García
- Universidad Complutense de Madrid, ES

**Abstract**

In 1951 A. Booth published his algorithm to efficiently multiply signed numbers. Since the appearance of such algorithm, it has been widely accepted that radix 4-based Booth multipliers are the most efficient. They allow the height of the multiplier to be halved, at the expense of a simple recoding that consists of just shifts and negations. Theoretically, higher radix should produce even larger reductions, especially in terms of area and power, but the recoding process is much more complex. Notably, in the case of radix 8 it is necessary to compute 3X as an extra operation within the application’s Dataflow Graph (DFG). Experiments show that typically there is enough slack in the DFGs to do this without degrading the performance of the circuit, which permits the efficient deployment of radix 8 multipliers that do not calculate the 3X multiple. Results show that our approach is 10% and 17% faster than radix 4 and radix 8 Booth based implementations, respectively, and 12% and 10% more energy efficient in terms of Energy Delay Product.

Download Paper (PDF; Only available from the DATE venue WiFi)

### 9.4.5 Modeling and optimization of Internet-of-things (IoT) devices

**Date:** Thursday 30 March 2017

**Time:** 08:30 - 10:00

**Location / Room:** 3C

**Chair:**
- William Fornaiani, Politecnico di Milano, IT

**Co-Chair:**
- Shusuke Yoshimoto, Osaka University, JP

Modeling and optimization of Internet-of-things (IoT) devices from energy sources to computing components including battery, energy harvesting system, power converter, and microprocessor.

**Time Label Presentation Title Authors**

**09:30** 9.4.3 **DESIGN SPACE EXPLORATION OF FPGA ACCELERATORS FOR CONVOLUTIONAL NEURAL NETWORKS**
- **Speaker:** Jongeun Lee, UNIST, KR
- **Authors:**
  - Atul Rahman, Sangyun Oh, Jongeun Lee, and Kiyoung Choi
  - Samsung Electronics, Samsung Electronics, UNIST, UNIST, Seoul National University, Samsung Electronics, UNIST, Seoul National University, Seoul National University

**Abstract**

This paper presents a new schedulability test for safety-critical software undergoing a transition from single-core to multicore systems - a challenge faced by multiple industries today. Our migration model consists of a schedulability test and execution model. Its properties enable us to obtain a utilization bound that places an allowable limit on total task execution times. Evaluation results demonstrate the advantages of our scheduling model over competing resource partitioning approaches, such as Periodic Server and TDMA.

Download Paper (PDF; Only available from the DATE venue WiFi)

**09:45** 9.4.4 **A SLACK-BASED APPROACH TO EFFICIENTLY DEPLOY RADIX 8 BOOTH MULTIPLIERS**
- **Speaker:** Alberto Antonio Del Barrio Garcia and Hermida Roman, Complutense University of Madrid, ES
- **Authors:**
  - Alberto Antonio Del Barrio Garcia
  - Universidad Complutense de Madrid, ES

**Abstract**

In 1951 A. Booth published his algorithm to efficiently multiply signed numbers. Since the appearance of such algorithm, it has been widely accepted that radix 4-based Booth multipliers are the most efficient. They allow the height of the multiplier to be halved, at the expense of a simple recoding that consists of just shifts and negations. Theoretically, higher radix should produce even larger reductions, especially in terms of area and power, but the recoding process is much more complex. Notably, in the case of radix 8 it is necessary to compute 3X as an extra operation within the application’s Dataflow Graph (DFG). Experiments show that typically there is enough slack in the DFGs to do this without degrading the performance of the circuit, which permits the efficient deployment of radix 8 multipliers that do not calculate the 3X multiple. Results show that our approach is 10% and 17% faster than radix 4 and radix 8 Booth based implementations, respectively, and 12% and 10% more energy efficient in terms of Energy Delay Product.

Download Paper (PDF; Only available from the DATE venue WiFi)

**10:00** 9.4.5 **A SCHEDULABILITY TEST FOR SOFTWARE MIGRATION ON MULTICORE SYSTEMS**
- **Speaker:** Jung-Eun Kim, Department of Computer Science at the University of Illinois at Urbana-Champaign, US
- **Authors:**
  - Jung-Eun Kim
  - Richard Bradford, Tarek Abdelzaher, and Lui Sha
  - 1Department of Computer Science, University of Illinois at Urbana-Champaign, US; 2Rockwell Collins, Cedar Rapids, IA, US; 3University of Illinois, US

**Abstract**

This paper presents a new schedulability test for safety-critical software undergoing a transition from single-core to multicore systems - a challenge faced by multiple industries today. Our migration model consists of a schedulability test and execution model. Its properties enable us to obtain a utilization bound that places an allowable limit on total task execution times. Evaluation results demonstrate the advantages of our scheduling model over competing resource partitioning approaches, such as Periodic Server and TDMA.

Download Paper (PDF; Only available from the DATE venue WiFi)

**10:00** End of session

**Coffee Break in Exhibition Area**

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

- Tuesday, March 28, 2017
  - Coffee Break 10:30 - 11:30
  - Coffee Break 16:00 - 17:00

- Wednesday, March 29, 2017
  - Coffee Break 10:00 - 11:00
  - Coffee Break 16:00 - 17:00

- Thursday, March 30, 2017
  - Coffee Break 10:00 - 11:00
  - Coffee Break 15:30 - 16:00
MEASUREMENT AND VALIDATION OF ENERGY HARVESTING IOT DEVICES
Speaker: Lukas Sigrist, ETH Zurich, CH
Authors: Lukas Sigrist1, Andres Gomez2, Roman Lim1, Stefan Lippuner1, Matthias Leubnitz1 and Lothar Thiele2
1ETH Zurich, CH; 2Swiss Federal Institute of Technology Zurich, CH
Abstract
With the appearance of wearable devices and the IoT, energy harvesting nodes are becoming more and more important. The design and evaluation of these small standalone sensors and actuators, which harvest limited amounts of energy, requires novel tools and methods. Fast and accurate measurement systems are required to capture the rapidly changing harvesting scenarios and characterize leakage currents and energy efficiencies. The need for real-world experiments creates a demand for compact and portable equipment to perform in-situ power measurements and environmental logging. This work presents the RocketLogger, a hand-held measurement device that combines both portability: portability device that combines both portability and a high-quality datalogger. The custom analog front-end allows logging at sampling rates up to 64 kPS. The fast range switching within 1.4 us guarantees continuous power measurements starting from 4 pW at 1 mV up to 2.75 W at 5.5 V. The software provides remote control and manages data acquisition of up to 13 Mb/sec in real-time. We extensively characterize the RocketLogger’s performance, demonstrate the need for its properties in three use-cases at different stages of the system design flow, and show its advantages in measuring and validating new harvesting-driven devices for the IoT.
Download Paper (PDF; Only available from the DATE venue WiFi)

A METHODOLOGY FOR THE DESIGN OF DYNAMIC ACCURACY OPERATORS BY RUNTIME BACK BIAS
Speaker: Daniela Zahier Pagliari, Politecnico di Torino, IT
Authors: Daniela Zahier Pagliari1, Yves Durand2, David Coriat2, Anca Molnos2, Edith Beigne2, Enrico Macii1 and Massimo Poncino1
1Politecnico di Torino, IT; 2CEA-Leti, FR
Abstract
Mobile and IoT applications must balance increasing processing demands with limited power and cost budgets. Approximate computing achieves this goal leveraging the error tolerance features common in many emerging applications to reduce power consumption. In particular, adequate (i.e., energy/quality-configurable) hardware operators are key components in an error tolerant system. Existing implementations of these operators require significant architectural modifications, hence they are often design-specific and tend to have large overheads compared to accurate units. In this paper, we propose a methodology to design adequate datapath operators in an automatic way, which uses threshold voltage scaling as a knob to dynamically control the power/accuracy tradeoff. The method overcomes the limitations of previous solutions based on supply voltage scaling, in that it introduces lower overheads and it allows fine-grain regulation of this tradeoff. We demonstrate our approach on a state-of-the-art 28nm FDSOI technology, exploiting the strong effect of back biasing on threshold voltage. Results show a power consumption reduction of as much as 39% compared to solutions based only on supply voltage scaling, at iso-accuracy.
Download Paper (PDF; Only available from the DATE venue WiFi)

A SCAN-CHAIN BASED STATE RETENTION METHODOLOGY FOR IOT PROCESSORS OPERATING ON INTERMITTENT ENERGY
Speaker: Pascal Alexander Hager, ETH Zurich, CH
Authors: Pascal Alexander Hager1, Hamed Fatemi2, Jose Pineda2 and Luca Benini1
1ETH Zurich, CH; 2NXP Semiconductors, NL; 3Università di Bologna, IT
Abstract
Future IoT systems are tightly constraint by cost and size and will often be operated from an energy harvester’s output. Since these batteryless systems operate on intermittent energy they have to be able to retain their state during the power outages in order to guarantee computation progress. Due to the lack of large energy buffers the state needs to be saved quickly using residual energy only. In related work, the state is retained in-place by replacing all flip-flops with state-retentive flip-flops (SRFF), which are powered by auxiliary supplies for retention or incorporate non-volatile memory cells. However, these SRFFs increase the power consumption during active operation impairing the overall systems efficiency. In this paper, we present a scan-chain based state retention approach, where the state is moved to memory using only 4.5pJ/b. Since our approach does not introduce any power overhead, this energy cost pays off after an on-time of just 100us compared to state-of-the-art in-place solutions. Moreover, compared to a software mechanism, our approach requires 6.6x less energy to move the state and is 5.8x faster.
Download Paper (PDF; Only available from the DATE venue WiFi)

A CIRCUIT-EQUIVALENT BATTERY MODEL ACCOUNTING FOR THE DEPENDENCY ON LOAD FREQUENCY
Speaker: Yukai Chen, Politecnico di Torino, IT
Authors: Yukai Chen, Enrico Macii and Massimo Poncino, Politecnico di Torino, IT
Abstract
Circuit-equivalent battery models are considered de-facto standard for modeling and simulation of digital systems due to many practical advantages. In spite of the many variants of models proposed in the literature, none of them accounts for one important feature of the battery dynamics, namely, the dependency on the frequency of current load profile. For a given average current value, current loads with different spectral distributions may have quite different impacts on the battery discharge. This is a very well-known issue in the design of hybrid energy storage systems, where different types of storage devices are used, each with different storage efficiency for different load frequency ranges. We propose a basic modification to a state-of-the-art model that incorporates this load frequency dependency, as well as a methodology to identify the frequency-sensitive parameters of the model from publicly available data (e.g., datasheets). Results show that frequency-agnostic models can significantly overestimate the battery state-of-charge, and that this effect is far from being negligible.
Download Paper (PDF; Only available from the DATE venue WiFi)

MEASUREMENT AND VALIDATION OF ENERGY HARVESTING IOT DEVICES
Speaker: Lukas Sigrist, ETH Zurich, CH
Authors: Lukas Sigrist1, Andres Gomez2, Roman Lim1, Stefan Lippuner1, Matthias Leubnitz1 and Lothar Thiele2
1ETH Zurich, CH; 2Swiss Federal Institute of Technology Zurich, CH
Abstract
With the appearance of wearable devices and the IoT, energy harvesting nodes are becoming more and more important. The design and evaluation of these small standalone sensors and actuators, which harvest limited amounts of energy, requires novel tools and methods. Fast and accurate measurement systems are required to capture the rapidly changing harvesting scenarios and characterize leakage currents and energy efficiencies. The need for real-world experiments creates a demand for compact and portable equipment to perform in-situ power measurements and environmental logging. This work presents the RocketLogger, a hand-held measurement device that combines both portability: portability device that combines both portability and a high-quality datalogger. The custom analog front-end allows logging at sampling rates up to 64 kPS. The fast range switching within 1.4 us guarantees continuous power measurements starting from 4 pW at 1 mV up to 2.75 W at 5.5 V. The software provides remote control and manages data acquisition of up to 13 Mb/sec in real-time. We extensively characterize the RocketLogger’s performance, demonstrate the need for its properties in three use-cases at different stages of the system design flow, and show its advantages in measuring and validating new harvesting-driven devices for the IoT.
Download Paper (PDF; Only available from the DATE venue WiFi)

A METHODOLOGY FOR THE DESIGN OF DYNAMIC ACCURACY OPERATORS BY RUNTIME BACK BIAS
Speaker: Daniela Zahier Pagliari, Politecnico di Torino, IT
Authors: Daniela Zahier Pagliari1, Yves Durand2, David Coriat2, Anca Molnos2, Edith Beigne2, Enrico Macii1 and Massimo Poncino1
1Politecnico di Torino, IT; 2CEA-Leti, FR
Abstract
Mobile and IoT applications must balance increasing processing demands with limited power and cost budgets. Approximate computing achieves this goal leveraging the error tolerance features common in many emerging applications to reduce power consumption. In particular, adequate (i.e., energy/quality-configurable) hardware operators are key components in an error tolerant system. Existing implementations of these operators require significant architectural modifications, hence they are often design-specific and tend to have large overheads compared to accurate units. In this paper, we propose a methodology to design adequate datapath operators in an automatic way, which uses threshold voltage scaling as a knob to dynamically control the power/accuracy tradeoff. The method overcomes the limitations of previous solutions based on supply voltage scaling, in that it introduces lower overheads and it allows fine-grain regulation of this tradeoff. We demonstrate our approach on a state-of-the-art 28nm FDSOI technology, exploiting the strong effect of back biasing on threshold voltage. Results show a power consumption reduction of as much as 39% compared to solutions based only on supply voltage scaling, at iso-accuracy.
Download Paper (PDF; Only available from the DATE venue WiFi)

A SCAN-CHAIN BASED STATE RETENTION METHODOLOGY FOR IOT PROCESSORS OPERATING ON INTERMITTENT ENERGY
Speaker: Pascal Alexander Hager, ETH Zurich, CH
Authors: Pascal Alexander Hager1, Hamed Fatemi2, Jose Pineda2 and Luca Benini1
1ETH Zurich, CH; 2NXP Semiconductors, NL; 3Università di Bologna, IT
Abstract
Future IoT systems are tightly constraint by cost and size and will often be operated from an energy harvester’s output. Since these batteryless systems operate on intermittent energy they have to be able to retain their state during the power outages in order to guarantee computation progress. Due to the lack of large energy buffers the state needs to be saved quickly using residual energy only. In related work, the state is retained in-place by replacing all flip-flops with state-retentive flip-flops (SRFF), which are powered by auxiliary supplies for retention or incorporate non-volatile memory cells. However, these SRFFs increase the power consumption during active operation impairing the overall systems efficiency. In this paper, we present a scan-chain based state retention approach, where the state is moved to memory using only 4.5pJ/b. Since our approach does not introduce any power overhead, this energy cost pays off after an on-time of just 100us compared to state-of-the-art in-place solutions. Moreover, compared to a software mechanism, our approach requires 6.6x less energy to move the state and is 5.8x faster.
Download Paper (PDF; Only available from the DATE venue WiFi)

A CIRCUIT-EQUIVALENT BATTERY MODEL ACCOUNTING FOR THE DEPENDENCY ON LOAD FREQUENCY
Speaker: Yukai Chen, Politecnico di Torino, IT
Authors: Yukai Chen, Enrico Macii and Massimo Poncino, Politecnico di Torino, IT
Abstract
Circuit-equivalent battery models are considered de-facto standard for modeling and simulation of digital systems due to many practical advantages. In spite of the many variants of models proposed in the literature, none of them accounts for one important feature of the battery dynamics, namely, the dependency on the frequency of current load profile. For a given average current value, current loads with different spectral distributions may have quite different impacts on the battery discharge. This is a very well-known issue in the design of hybrid energy storage systems, where different types of storage devices are used, each with different storage efficiency for different load frequency ranges. We propose a basic modification to a state-of-the-art model that incorporates this load frequency dependency, as well as a methodology to identify the frequency-sensitive parameters of the model from publicly available data (e.g., datasheets). Results show that frequency-agnostic models can significantly overestimate the battery state-of-charge, and that this effect is far from being negligible.
Download Paper (PDF; Only available from the DATE venue WiFi)

ADAPTIVE POWER DELIVERY SYSTEM MANAGEMENT FOR MANY-CORE PROCESSORS WITH ON/OFF-CHIP VOLTAGE REGULATORS
Speaker: Haoran Li, The Hong Kong University of Science and Technology, HK
Authors: Haoran Li, Jiang Xu, Zhe Wang, Peng Yang, Rafael Kioji Vivas Maeda and Zhongyuan Tian, The Hong Kong University of Science and Technology, HK
Abstract
The power delivery system (PDS) plays a crucial role of guaranteeing the proper functionality of many-core processors. However, as PDS is usually optimized to provide power to the target chip at its best performance level, its energy efficiency can be seriously degraded under highly dynamic workloads, making it a major source of system power losses. On-chip voltage regulators (VR), which are able to achieve fast and fine-grained power control, have been proposed as an alternative. However, the high power consumption of VRs has been a major challenge to implement fine-grained power adjustments. In this paper, we propose the adaptive Quantized Power Management (QPM) scheme to dynamically adjust the PDS with both on-chip and off-chip VRs based on run-time workloads. Experimental results show that QPM applied on a hybrid PDS with both on/off-chip voltage regulators (VR) achieves 74.1% average overall energy efficiency, 12.3% higher than the conventional PDS with single off-chip VR.
9.6 Reliability and Optimization Techniques for Analog Circuits

**Date:** Thursday 30 March 2017  
**Location / Room:** 5A

**Chair:**  
Manuel Barragan, TIMA, FR

**Co-Chair:**  
Said Hamdioui, TU Delft, NL

The first two papers discuss optimizations for yield and performances of analog circuits. The third paper proposes methods for flip-flop soft error protection in sequential circuits while the last paper discusses methods based on machine learning for timing error detection.

### 9.6.1 SLOT: A SUPERVISED LEARNING MODEL TO PREDICT DYNAMIC TIMING ERRORS OF FUNCTIONAL UNITS

**Speaker:** Xun Jiao, University of California San Diego, US  
**Authors:** Xun Jiao, Yu Jiang, Abbas Rahimi and Rajesh Gupta  
1University of California, San Diego, US; 2Tsinghua University, CN; 3University of California, Berkeley, US

**Abstract**  
Dynamic timing errors (DTEs), that are caused by the timing violations of sensitized critical timing paths, have emerged as an important threat to the reliability of digital circuits. Existing approaches model the DTEs without considering the impact of input operands on dynamic path sensitization, resulting in loss of accuracy. The diversity of input operands leads to complex path sensitization behaviors, making it hard to represent in DTE modeling. In this paper, we propose SLOT, a supervised learning model to predict the output of functional units (FUs) to be one of two timing classes: (timing correct, timing erroneous), as a function of input operands and clock period. We apply random forest classification (RFC) method to construct SLOT, by using input operands, computation history and circuit toggling as input features and outputs’ timing classes as labels. The outputs timing classes are measured using gate-level simulation (GLS) of a post place-and-route design in TSMC 45nm process. For evaluation, we apply SLOT to several FUs and on average 95% predictions are consistent with GLS, which is 6.3X higher compared to the existing instruction-level model. SLOT-based reliability analysis of FUs under different benchmark datasets can achieve 0.7-4.8% average difference compared with GLS-based analysis, and execute more than 20X faster than GLS.

Download Paper (PDF; Only available from the DATE venue WiFi)

### 9.6.2 EXPLOITING DATA-DEPENDENCE AND FLIP-FLOP ASYMMETRY FOR ZERO-OVERHEAD SYSTEM SOFT ERROR MITIGATION

**Speaker:** Liangzhen Lai and Vikas Chandra, ARM Inc., US  
**Authors:** Liangzhen Lai, ARM Inc., US

**Abstract**  
Soft error is one of the major threats for resilient computing. Unlike SRAM soft error, which can be effectively protected by ECC, Flip-Flop soft error protection can be costly. We observe that flip-flops/batches can have asymmetric soft error rates when storing different logic values. This asymmetry can be used in conjunction with the different signal probabilities of registers in a design. In this work, we first demonstrate that flip-flop cells can be designed to have different soft error rates when storing different logic values. We also propose a methodology to match registers in a design with the flip-flop cells that minimize the soft error rates. Experimental results on commercial processor show that, with only flip-flop layout changes, our proposed scheme can reduce system SER by 16% with no overhead in performance, power and area. The system SER reduction can be improved to 48% with schematic changes and 6.7% average increase in flip-flop area.

Download Paper (PDF; Only available from the DATE venue WiFi)
EFICIENT YIELD OPTIMIZATION METHOD USING A VARIABLE K-MEANS ALGORITHM FOR ANALOG IC SIZING

Speaker: António Canelas, Instituto de Telecomunicações/Instituto Superior Técnico – Lisboa, PT
Authors: António Canelas, Ricardo Martins, Ricardo Povo, Nuno Lourenço, and Nuno Horta

Abstract: This paper presents a novel approach for improving automated analog yield optimization using a two step exploration strategy. First, a global optimization phase relies on a modified Lipschitzian optimization to sample the potential optimal sub-regions of the feasible design space. The search locates a design point near the optimal solution that is used as a starting point by a local optimization phase. The local search constructs linear interpolating surrogate models of the yield to explore the basin of convergence and to rapidly reach the global optimum. Experimental results show that our approach locates higher quality design points in terms of yield rate within less run time and without affecting the accuracy.

Download Paper (PDF; Only available from the DATE venue WiFi)

A NEW SAMPLING TECHNIQUE FOR MONTE CARLO-BASED STATISTICAL CIRCUIT ANALYSIS

Speaker: Hiwa Mahmoudi, Vienna University of Technology, AT
Authors: Hiwa Mahmoudi and Horst Zimmermann, Vienna University of Technology, AT

Abstract: Variability is a fundamental issue which gets exponentially worse as CMOS technology shrinks. Therefore, characterization of statistical variations has become an important part of the design phase. Monte Carlo-based simulation method is a standard technique for statistical analysis and modeling of integrated circuits. However, crude Monte Carlo sampling based on pseudorandom selection of parameter variations suffers from low convergence rates and thus, providing high accuracy is computationally expensive. In this work, we present an extensive study on the performance of two widely used techniques, Latin Hypercube and Low Discrepancy sampling methods, and compare their speed-up and accuracy performance properties. It is shown that these methods can exhibit a better efficiency as compared to the pseudorandom sampling but only in limited applications. Therefore, we propose a new sampling scheme that exploits the benefits of both methods by combining them. Through representative circuit examples, it is shown that the proposed sampling technique provides a major improvement in terms of computational effort and offers better properties as compared to each solely.

Download Paper (PDF; Only available from the DATE venue WiFi)

AUTOMATIC TECHNOLOGY MIGRATION OF ANALOG IC DESIGNS USING GENERIC CELL LIBRARIES

Speaker: Jose Cachaco, Instituto de Telecomunicacoes/Instituto Superior Tecnico, PT; Instituto de Telecomunicacoes/Instituto Politecnico de Tomar, PT; Instituto de Telecomunicacoes/Instituto Superior Tecnico, PT
Authors: Jose Cachaco, Nuno Machado, Nuno Lourenço, Jorge Guilherme, and Nuno Horta

Abstract: This paper addresses the problem of automatic technology migration of analog IC designs. The proposed approach introduces a new level of abstraction, for EDA tools addressing analog IC design, allowing a systematic and effortless adaptation of a design to a new technology. The new abstraction level is based on generic cell libraries, which includes topology and testbenches descriptions for specific circuit classes. The new approach is implemented and tested using a state-of-the-art multi-objective multi-constraint circuit-level optimization tool, and is validated for the sizing and optimization of continuous-time comparators, including technology migration between two different design nodes, respectively, XFAB 350 nm technology (XHDS3) and ATME 150 nm SOI technology (AT7X).

Download Paper (PDF; Only available from the DATE venue WiFi)
NOISE-SENSITIVE FEEDBACK LOOP IDENTIFICATION IN LINEAR TIME-VARYING ANALOG CIRCUITS

Speaker: Peng Li, Texas A&M University, US
Authors: Ang Li, Peng Li 1, Tingwen Huang 2 and Edgar Sánchez-Sinencio 1
1Texas A&M University, US; 2Texas A&M University at Qatar, QA

Abstract
The continuing scaling of VLSI technology and design complexity has rendered robustness of analog circuits a significant concern. Parasitic effects may introduce unexpected marginal instability within multiple noise-sensitive loops and hence jeopardize circuit operation and processing precision. The Loop Finder algorithm has been recently proposed to allow detection of noise-sensitive return loops for circuits that are described using a linear time-invariant (LTI) system model. However, many practical circuits such as switched-capacitor filters and mixers present time-varying behaviors which are intrinsically coupled with noise propagation and introduce new noise generation mechanisms. For the first time, we take an in-depth look into the marginal instability of linear periodically time-varying (LPTV) analog circuits and further develop an algorithm for efficient identification of noise-sensitive loops, unifying the solution to noise sensitivity analysis for both LTI and LPTV circuits.

Download Paper (PDF; Only available from the DATE venue WiFi)
A THERMALLY-AWARE ENERGY MINIMIZATION METHODOLOGY FOR GLOBAL INTERCONNECTS

Speaker:
Alfazli Kusha, Tehran University, IR

Authors:
Soheil Nazar Shahsavani 1, Alireza Shafaei Bejestan 1, Shahin Nazarian 1 and Massoud Pedram 2
1University of Southern California, US; 2USC, US

Abstract
As a result of the Temperature Effect Inversion (TEI) in FinFET-based designs, gate delays decrease with the increase of temperature. In contrast, the resistive characteristic and hence delay of global interconnects increase with the temperature. However, as shown in this paper, if buffers are judiciously inserted in global interconnects, the buffer delay decrease is more pronounced than the interconnect delay increase, resulting in an overall performance improvement at higher temperatures. More specifically, this work models the delay of buffer-inserted global interconnects vs. temperature in order to derive the optimal number and size of buffers for a given interconnect length and temperature. Furthermore, the paper addresses the problem of minimizing the buffered interconnect energy consumption by changing the supply voltage level or FinFET threshold voltage, and also presents a temperature-aware optimization policy for solving this problem. Simulation results show average interconnect energy savings of 16% with no performance penalty for five different benchmarks implemented on a 14nm FinFET technology.

Download Paper (PDF; Only available from the DATE venue WiFi)

CANDY-TM: COMPARATIVE ANALYSIS OF DYNAMIC THERMAL MANAGEMENT IN MANY-CORES USING MODEL CHECKING

Speaker:
Muhammad Shafique, Institute of Computer Engineering, Vienna University of Technology (TU Wien), AT

Authors:
Sayed Ali Asadullah Bukhari 1, Faiz Khalid Lodhi 2, Osman Hasan 2, Muhammad Shafique 3 and Joerg Henkel 4
1National University of Sciences and Technology - School of Electrical Engineering and Computer Science, PK; 2School of Electrical Engineering and Computer Science National University of Sciences and Technology (NUST), PK; 3Vienna University of Technology (TU Wien), AT; 4Karlsruhe Institute of Technology, DE

Abstract
Dynamic thermal management (DTM) techniques based on task migration provide a promising solution to mitigate thermal emergencies and thereby ensuring safe operation and reliability of Many-Core systems. These techniques can be classified as central or distributed on the basis of a central DTM controller for the whole system or individual DTM controllers for each core or set of cores in the system, respectively. However, having a trustworthy comparison between central (c-) and distributed (d-) DTM techniques to find out the most suitable one for a given system is quite challenging. This is primarily due to the systemic difference between cDTM and dDTM controllers, and the inherent non-exhaustiveness of simulation and emulation methods conventionally used for DTM analysis. In this paper, we present a novel methodology called CANDY-TM (stands for Comparative Analysis of Dynamic Thermal Management) that employs Model Checking to perform formal comparative analysis for cDTM and dDTM techniques. We identify a set of generic functional and performance properties to provide a common ground for their comparison. We demonstrate the usability and benefits of our methodology by comparing state-of-the-art cDTM and dDTM techniques, and illustrate which technique is good w.r.t. thermal stability and other task migration parameters. Such an analysis helps in selecting the most appropriate DTM for a given chip.

Download Paper (PDF; Only available from the DATE venue WiFi)

POWER PRE-CHARACTERIZED MESHING ALGORITHM FOR FINITE ELEMENT THERMAL ANALYSIS OF INTEGRATED CIRCUITS

Speaker:
Shohdy Abdelkader, Software Developer, EG

Authors:
Shohdy Abdelkader 1, Ala ElRouby 2 and Mohamed Dessouky 1
1Mentor, EG; 2Electric and Electronic Department, Faculty of Engineering and Natural Science, Yildirim Beyazit University, TR

Abstract
In this paper we present an adaptive meshing technique suitable for steady state finite element (FE) based thermal analysis of integrated circuits (ICs). The algorithm presented is a non iterative one where the technology used is first pre-characterized. The characterization results are then used for scanning the layout to detect high power regions then fine meshing them. Finally, the analysis is done only once. This makes it faster than conventional iterative adaptive meshing methods. The algorithm results showed comparable accuracy and better performance when compared to the flux based (iterative) and the power aware (non iterative) algorithms.

Download Paper (PDF; Only available from the DATE venue WiFi)

Coffee Break in Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017
• Coffee Break 10:30 - 11:30
• Coffee Break 16:00 - 17:00

Wednesday, March 29, 2017
• Coffee Break 10:00 - 11:00
• Coffee Break 16:00 - 17:00

Thursday, March 30, 2017
• Coffee Break 10:00 - 11:00
• Coffee Break 15:30 - 16:00
Today everything from the door locks, a heating system or vehicle can be connected to internet opening the endless possibilities of future innovative technologies. As more low-power and internet-connected gadgets and sensors are integrated to our lives, an increase in demand for developing secure and trustworthy IoT-based systems is becoming the key element to make winning products.

Although, there has been a steady increase in improving the security, still proper authentication and encrypted communications are not common; making the overall Internet as a network of insecure things. This session proposes a journey through several speeches to show the advances in technologies that master the security aspects of IoT.

The session starts with an in-depth overview of security challenges and the trends in the IoT ecosystem against cyber-threats. Then, introduces the STM32 and the secure IoT network of insecure things. This session proposes a journey through several speeches to show the advances in technologies that master the security aspects of IoT.

Despite the increase in security, proper authentication and encrypted communications are not common. This results in a network of insecure things. The session proposes a journey through several speeches to show the advances in technologies that master the security aspects of IoT.

The Internet of Things (IoT) is changing our lives, bringing huge benefits and making a positive impact on society and the economy. It requires trusted systems with efficient security and privacy mechanisms from devices to the Cloud. For years digital security technologies have proven their efficiency in telecom, banking and ID applications. Technical solutions exist, but they can be reused as a tool to provide security and privacy for IoT.

In this session, we will describe how STMicroelectronics' scalable security offers based on STM32 microcontrollers and STSAFE secure microcontrollers make it possible to build secure IoT solutions with the right level of robustness. The STMicroelectronics scalable offer for IoT security can be easily adapted to efficiently combat various threats. STMicroelectronics, a global semiconductor leader supplying the market with the most advanced technologies and solutions and a 20-year presence in security, is committed to contributing to a more secure connected world.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>08:30</td>
<td>9.8.1</td>
<td>CHALLENGES FOR SECURE IOT</td>
<td>Paolo Pirinetti, Politecnico di Torino, IT</td>
</tr>
<tr>
<td>08:45</td>
<td>9.8.2</td>
<td>MITIGATING THE RISKS IN IOT WITH AN EFFECTIVE SECURITY OFFER</td>
<td>Michele Scarlatella, STMicroelectronics, FR</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract:</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>The IoT will change our lives, bringing huge benefits and making a positive impact on society and the economy, but it requires trusted systems with efficient security and privacy mechanisms from devices to the Cloud. For years digital security technologies have proven their efficiency in telecom, banking and ID applications. Technical solutions exist, but can be reused as a tool to provide security and privacy for the IoT. In this session, we will describe how STMicroelectronics' scalable security offers based on STM32 microcontrollers and STSAFE secure microcontrollers make it possible to build secure IoT solutions with the right level of robustness. The STMicroelectronics scalable offer for IoT security can be easily adapted to efficiently combat various threats. STMicroelectronics, a global semiconductor leader supplying the market with the most advanced technologies and solutions and a 20-year presence in security, is committed to contributing to a more secure connected world.</td>
<td></td>
</tr>
<tr>
<td>09:00</td>
<td>9.8.3</td>
<td>UNIVERSITY EXPERIENCES USING A SECURE IOT PLATFORM BASED ON STM32</td>
<td>George Kornarors, Univ. of Applied Sciences of Crete, GR</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract:</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>In this session, practical design methods and experiences are presented centered on STM32 devices. Gateways and connected IoT devices networks need to be secured as well as the devices themselves. Suitable safeguards must be integrated to prevent network interfaces and embedded firmware updates from becoming security holes themselves; these safeguards refer to securing the data stored by the device, secure communication and protecting the device from cyber-attacks. Software and hardware development approaches are outlined along with practical experiences that meet the appropriate security level of modern IoT platforms.</td>
<td></td>
</tr>
<tr>
<td>09:15</td>
<td>9.8.4</td>
<td>SECURE COMMUNICATION IN AUTOMOTIVE</td>
<td>Antonio Varriale, Blu5 Labs Ltd, MT</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract:</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>The SEcube™ (Secure Environment cube) platform presented in this session is an open source security-oriented hardware and software platform constructed with ease of integration and service-orientation in mind. It is based on a single-chip design embedding three main cores: a high-power processor, a Common Criteria certified smartcard, and a flexible FPGA. The software components include several libraries of ready-to-use components that provide developers with different entry levels to adoption. This way, security experts can avail of the open source character and verify, change or write from scratch the entire system, starting from the elementary low-level blocks. At the same time developers who use the predefined primitives can experience the SEcube™ as a high-security black box suitable for security-oriented services in several fields, like IoT, Automotive, etc.</td>
<td></td>
</tr>
<tr>
<td>09:30</td>
<td>9.8.5</td>
<td>SECURE COMMUNICATION IN AUTOMOTIVE</td>
<td>Giovanni Gherardi, Energica Motor Company, Italy</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract:</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>The growth and diffusion of high technology consumer communication devices and the following tech skills in average user are pushing industry to put connectivity/network functions in devices. Automotive industry is riding as well this wave. Vehicles are nowadays implementing new “Cyber Physical Features” by collecting information from the physical system and processing it via interconnected cyber systems, creating thus new challenges for safety and security. In addition, an increasing number of vehicles are nowadays connected to the Web, and the capillarity of interconnected IoT devices are drawing the future for the customer expectations in terms of innovative services. Historically, security was first of all achieved with isolation of subsystems and, nowadays, with the growing number of interconnected systems that are indirectly interconnected with IoT services highlight how component level countermeasures are important but not enough to enforce protection in a modern vehicle. A multi-level, coordinated, system wide approach is necessary such as isolation of safety critical systems, secure gateways, virtualization, trusted software injection and execution, but not only. It requires also a re-design of vehicle data transport infrastructure with new communication standards with the adoption of secure protocols like sCAN.</td>
<td></td>
</tr>
</tbody>
</table>
Coffee Break in Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017
- Coffee Break 10:30 - 11:30
- Coffee Break 16:00 - 17:00

Wednesday, March 29, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 16:00 - 17:00

Thursday, March 30, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 15:30 - 16:00

### IP4 Interactive Presentations

**Date:** Thursday 30 March 2017  
**Time:** 10:00 - 10:30  
**Location / Room:** IP sessions (in front of rooms 4A and 5A)

Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the morning. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award ‘Best IP of the Day’ is given.

#### IP4-1  
**1024-CHANNEL 3D ULTRASOUND DIGITAL BEAMFORMER IN A SINGLE 5W FPGA**

**Presenter:** Aya Ibrahim, EPFL, CH  
**Authors:** Federico Angiolini, Aya Ibrahim, William Simon, Ahmet Caner Yüzügüler, Marcel Arditi, Jean-Philippe Thiran and Giovanni De Micheli  
**Abstract**  
3D ultrasound, an emerging medical imaging technique that is presently only used in hospitals, has the potential to enable breakthrough telemedicine applications, provided that its cost and power dissipation can be minimized. In this paper, we present an FPGA architecture suitable for a portable medical 3D ultrasound device. We show an optimized design for the digital part of the imager, including the delay calculation block, which is its most critical part. Our computationally efficient approach requires a single FPGA for 3D imaging, which is unprecedented. The design is scalable; a configuration supporting a 32×32-channel probe, which enables high-quality imaging, consumes only about 5W.

**Download Paper (PDF; Only available from the DATE venue WiFi)**

#### IP4-2  
**LAANT: A LIBRARY TO AUTOMATICALLY OPTIMIZE EDP FOR OPENMP APPLICATIONS**

**Presenter:** Arthur Francisco Lorenzon, Federal University of Rio Grande do Sul, BR  
**Authors:** Arthur Lorenzon, Jeckson Dellagostin Souza and Antonio Carlos Schneider Beck Filho, Universidade Federal do Rio Grande do Sul, BR  
**Abstract**  
Efficiently exploiting thread level parallelism from new multicore systems has been challenging for software developers. While blindly increasing the number of threads may also result in disproportionate increase in energy consumption. For this reason, rightly choosing the number of threads is essential to reach the best compromise between both. However, such task is extremely difficult: besides the huge number of variables involved, many of them will change according to different aspects of the system at hand and are only possible to be defined at run-time. To address this complex scenario, we propose LAANT, a novel library to automatically find the optimal number of threads for OpenMP applications, by dynamically considering their particular characteristics, input set, and the processor architecture. By executing nine well-known benchmarks on three real multicore processors, LAANT improves the EDP (Energy-Delay Product) by up to 61%, compared to the standard OpenMP execution; and by 44%, when the dynamic adjustment of the number of threads of OpenMP is activated.

**Download Paper (PDF; Only available from the DATE venue WiFi)**

#### IP4-3  
**IMPROVING THE ACCURACY OF THE LEAKAGE POWER ESTIMATION OF EMBEDDED CPUs**

**Presenter:** Shiao-Li Tsao, National Chiao Tung University, TW  
**Authors:** Ting-Wu Chin, Shiao-Li Tsao, Kuo-Wei Hung and Pei-Shu Huang, National Chiao Tung University, TW  
**Abstract**  
Previous studies have used on-chip thermal sensors (diodes) to estimate the leakage power of a CPU. However, an embedded CPU equips only a few thermal sensors and may suffer from considerable spatial temperature variances across the CPU core, and leakage power estimation based on insufficient temperature information introduces errors. According to our experiments, the conventional leakage power models may have up to 22.9% estimation error for a 70-nm embedded CPU. In this study, we first evaluated the accuracy of leakage power estimates based on thermal sensors on different locations of a CPU and suggested locations that can reduce the error to 0.9%. Then, we proposed temperature-referred and counter-tracked estimation (TRACE) that relies on temperature sensors and hardware activity counters to estimate leakage power. The simulation results demonstrated that employing TRACE could reduce the error to 3.4%. Experiments were also conducted on a real platform to verify our findings.

**Download Paper (PDF; Only available from the DATE venue WiFi)**
LIFETIME ADAPTIVE ECC IN NAND FLASH PAGE MANAGEMENT

Speaker: Shunhuo Wang, Huazhong University of Science and Technology, CN
Authors: Shunhuo Wang,1 Fei Wu,1 Zhonghai Lu,2 You Zhou1, Qin Xiong1, Meng Zhang1 and Changsheng Xie1
Abstract: With increasing density, NAND flash memory has decreasing reliability. Furthermore, raw bit error rate (RBER) of flash memory grows at an exponential rate as program/erase (P/E) cycle increases. Thus, error correction codes (ECCs), usually stored in the out-of-band area (OOB) of flash pages, are widely employed to ensure the reliability. However, the worst-case oriented ECC is largely under-utilized in the early stage, i.e. when P/E cycles are small, and the required ECC redundancy may be too large to be stored in the OOB. In this paper, we propose LAE-FTL, which employs a lifetime-adaptive ECC scheme, to improve the performance and lifetime of NAND flash memory. In the early stage, weak ECCs can guarantee the reliability and the OOB is large enough to store the ECCs. Thus, LAE-FTL employs weak ECCs and adaptively uses small and incremental codewords as P/E cycle increases to improve data transfer and decoding parallelism. In the late stage with large P/E cycles, strong ECCs are needed and the ECC redundancies become too large to fit in the OOB. In this paper, we present a schedule-aware optimization approach enabled by exploiting the parallel slack existing in loops parallelized with OpenMP. Results on an embedded multicore device, show that the performance achieved by OpenMP loops optimized with our approach outperform by up to 93%, the performance achieved by the original OpenMP loops, where the scheduling policy is not specified.

Download Paper (PDF; Only available from the DATE venue WiFi)
on a modified Lipschitzian optimization to sample the potential optimal sub-regions of the feasible design space. The search locates a design point near the optimal solution that is used as a starting point by a local optimization phase. The local search constructs linear interpolating surrogate models of the yield to explore the basin of convergence and to rapidly reach the global optimum. Experimental results show that our approach locates higher quality design points in terms of yield rate within less run time and without affecting the accuracy.

**Abstract**

The proposed models for efficiency and droop voltage are validated with on-chip 2:1 SCVR implementations in both 65nm and 32nm CMOS, which show high model accuracy. The maximum and average error of the predicted optimal ratio between flying and decoupling capacitance are 5% and 1.7%, respectively.

**Download Paper (PDF; Only available from the DATE venue WiFi)**

---

### A SCHEDULABILITY TEST FOR SOFTWARE MIGRATION ON MULTICORE SYSTEMS

**Speaker:** Jung-Eun Kim, Department of Computer Science at the University of Illinois at Urbana-Champaign, US

**Authors:** Jung-Eun Kim1, Richard Bradford2, Tarek Abdelzaher3 and Lui Sha3

1Department of Computer Science, University of Illinois at Urbana-Champaign, US; 2Rockwell Collins, Cedar Rapids, IA, US; 3University of Illinois, US

**Abstract**

This paper presents a new schedulability test for safety-critical software undergoing a transition from single-core to multicore systems – a challenge faced by many industries today. Our migration model consists of a schedulability test and execution model. Its properties enable us to obtain a utilization bound that places an allowable limit on total task execution times. Evaluation results demonstrate the advantages of our scheduling model over competing resource partitioning approaches, such as Periodic Server and TDM.

**Download Paper (PDF; Only available from the DATE venue WiFi)**

---

### ADAPTIVE POWER DELIVERY SYSTEM MANAGEMENT FOR MANY-CORE PROCESSORS WITH ON/OFF-CHIP VOLTAGE REGULATORS

**Speaker:** Haoran Li, The Hong Kong University of Science and Technology, HK

**Authors:** Haoran Li, Jiang Xu, Zhe Wang, Peng Yang, Rafael Kioji Vivas Maeda and Zhongyuan Tian, The Hong Kong University of Science and Technology, HK

**Abstract**

The power delivery system (PDS) plays a crucial role of guaranteeing the proper functionality of many-core processors. However, as PDS is usually optimized to provide power to the target chip at its best performance level, its energy efficiency can be seriously degraded under highly dynamic workloads, making it a major source of system power losses. On-chip voltage regulators (VR), which are able to achieve fast and fine-grained power control, have been popular choices for PDS implementation and provided design opportunities for improving system energy efficiency. In this paper, we propose the adaptive Quantized Power Management (QPM) scheme to dynamically adjust the PDS with both on-chip and off-chip VRs based on run-time workloads. Experimental results on different applications show that QPM applied on a hybrid PDS with both on/off-chip voltage regulators (VR) achieves 74.1% average overall energy efficiency, 12.3% higher than the conventional PDS with single off-chip VR.

**Download Paper (PDF; Only available from the DATE venue WiFi)**

---

### FLYING AND DECOUPLING CAPACITANCE OPTIMIZATION FOR AREA-CONSTRAINED ON-CHIP SWITCHED-CAPACITOR VOLTAGE REGULATORS

**Speaker:** Xiaoyang Mi, Arizona State University, US

**Authors:** Xiaoyang Mi1, Hesam Fathi Moghadam2 and Jae-sun Seo 1

1Arizona State University, US; 2Oracle Corporation, US

**Abstract**

Switched-capacitor voltage regulators (SCVRs) are widely used in on-chip power management, due to high step-down efficiency and feasibility of integration. In this work, we present theoretical analysis and optimization methodology for flying and decoupling capacitance values for area-constrained on-chip SCVRs to achieve the highest system-level power efficiency. The proposed models for efficiency and droop voltage are validated with on-chip 2:1 SCVR implementations in both 65nm and 32nm CMOS, which show high model accuracy. The maximum and average error of the predicted optimal ratio between flying and decoupling capacitance are 5% and 1.7%, respectively.

**Download Paper (PDF; Only available from the DATE venue WiFi)**

---

### ENHANCING ANALOG YIELD OPTIMIZATION FOR VARIATION-AWARE CIRCUITS SIZING

**Speaker:** Ons Lahouel, Concordia University, CA

**Authors:** Ons Lahouel, Mohamed H. Zaki and Sofiene Tahar, Concordia University, CA

**Abstract**

This paper presents a novel approach for improving automated analog yield optimization using a two step exploration strategy. First, a global optimization phase relies on a modified Lipschitzian optimization to sample the potential optimal sub-regions of the feasible design space. The search locates a design point near the optimal solution that is used as a starting point by a local optimization phase. The local search constructs linear interpolating surrogate models of the yield to explore the basin of convergence and to rapidly reach the global optimum. Experimental results show that our approach locates higher quality design points in terms of yield rate within less run time and without affecting the accuracy.

**Download Paper (PDF; Only available from the DATE venue WiFi)**

---

### A NEW SAMPLING TECHNIQUE FOR MONTE CARLO-BASED STATISTICAL CIRCUIT ANALYSIS

**Speaker:** Hiwa Mahmoudi, Vienna University of Technology, AT

**Authors:** Hiwa Mahmoudi and Horst Zimmermann, Vienna University of Technology, AT

**Abstract**

Variability is a fundamental issue which gets exponentially worse as CMOS technology shrinks. Therefore, characterization of statistical variations has become an important part of the design phase. Monte Carlo-based simulation method is a standard technique for statistical analysis and modeling of integrated circuits. However, crude Monte Carlo sampling based on pseudorandom selection of parameter variations suffers from low convergence rates and thus, providing high accuracy is computationally expensive. In this work, we present an extensive study on the performance of two widely used techniques, Latin Hypercube and Low Discrepancy sampling methods, and compare their speed-up and accuracy performance properties. It is shown that these methods can exhibit a better efficiency as compared to the pseudorandom sampling but only in limited applications. Therefore, we propose a new sampling scheme that exploits the benefits of both methods by combining them. Through representative circuit examples, it is shown that the proposed sampling technique provides a major improvement in terms of computational effort and offers better properties as compared to each solely.

**Download Paper (PDF; Only available from the DATE venue WiFi)**
IP4-15

AUTOMATIC TECHNOLOGY MIGRATION OF ANALOG IC DESIGNS USING GENERIC CELL LIBRARIES

Speaker:
Nuno Horta, Instituto de Telecomunicações / Instituto Superior Técnico, PT

Authors:
Jose Cachaco1, Nuno Machado1, Nuno Lourenco1, Jorge Guilherme2 and Nuno Horta3
1Instituto de Telecomunicacoes/Instituto Superior Tecnico, PT; 2Instituto de Telecomunicacoes/Instituto Politecnico de Tomar, PT; 3Instituto de Telecomunicacoes/Instituto Superior Técnico, PT

Abstract
This paper addresses the problem of automatic technology migration of analog IC designs. The proposed approach introduces a new level of abstraction, for EDA tools addressing analog IC design, allowing a systematic and effortless adaption of a design to a new technology. The new abstraction level is based on generic cell libraries, which includes topology and testbenches descriptions for specific circuit classes. The new approach is implemented and tested using a state-of-the-art multi-objective multi-constraint circuit-level optimization tool, and is validated for the sizing and optimization of continuous-time comparators, including technology migration between two different design nodes, respectively, XFAB 350 nm technology (XH035) and ATYNE 150 nm SOI technology (AT77K).

Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-16

NOISE-SENSITIVE FEEDBACK LOOP IDENTIFICATION IN LINEAR TIME-VARYING ANALOG CIRCUITS

Speaker:
Peng Li, Texas A&M University, US

Authors:
Ang Li1, Peng Li1, Tingwen Huang2 and Edgar Sánchez-Sinencio1
1Texas A&M University, US; 2Texas A&M University at Qatar, QA

Abstract
The continuing scaling of VLSI technology and design complexity has rendered robustness of analog circuits a significant concern. Parasitic effects may introduce unexpected marginal instability within multiple noise-sensitive loops and hence jeopardize circuit operation and processing precision. The Loop Finder algorithm has been recently proposed to allow detection of noise-sensitive return loops for circuits that are described using a linear time-invariant (LTI) system model. However, many practical circuits such as switched-capacitor filters and mixers present time-varying behaviors which are intrinsically coupled with noise propagation and introduce new noise generation mechanisms. For the first time, we take an in-depth look into the marginal instability of linear periodically time-varying (LPTV) analog circuits and further develop an algorithm for efficient identification of noise-sensitive loops, unifying the solution to noise sensitivity analysis for both LTI and LPTV circuits.

Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-17

CANDY-TM: COMPARATIVE ANALYSIS OF DYNAMIC THERMAL MANAGEMENT IN MANY-CORES USING MODEL CHECKING

Speaker:
Muhammad Shafique, Institute of Computer Engineering, Vienna University of Technology (TU Wien), AT

Authors:
Syed Ali Asadullah Bukhari1, Faiz Khalid Lodhi1, Osman Hasani1, Muhammad Shafique1 and Joerg Henkel4
1National University of Sciences and Technology - School of Electrical Engineering and Computer Science, PK; 2School of Electrical Engineering and Computer Science National University of Sciences and Technology (NUST), PK; 3Vienna University of Technology (TU Wien), AT; 4Karlsruhe Institute of Technology, DE

Abstract
Dynamic thermal management (DTM) techniques based on task migration provide a promising solution to mitigate thermal emergencies and thereby ensuring safe operation and reliability of Many-Core systems. These techniques can be classified as central or distributed on the basis of a central DTM controller for the whole system or individual DTM controllers for each core or set of cores in the system, respectively. Having a trustworthy solution to the problem is quite challenging. This is primarily due to the systemic difference between cDTM and dDTM controllers, and the inherent non-exhaustiveness of simulation and emulation methods conventionally used for DTM analysis. In this paper, we present a novel methodology called CAnDy-TM (stands for Comparative Analysis of Dynamic Thermal Management) that employs Model Checking to perform formal comparative analysis for cDTM and dDTM techniques. We identify a set of generic functional and performance properties to provide a common ground for their comparison. We demonstrate the usability and benefits of our methodology by comparing state-of-the-art cDTM and dDTM techniques, and illustrate which technique is good w.r.t. thermal stability and other task migration parameters. Such an analysis helps in selecting the most appropriate DTM for a given chip.

Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-18

POWER PRE-CHARACTERIZED MESHING ALGORITHM FOR FINITE ELEMENT THERMAL ANALYSIS OF INTEGRATED CIRCUITS

Speaker:
Shohdy Abdelkader, Software Developer, EG

Authors:
Shohdy Abdelkader1, Alaa ElRouby1 and Mohamed Dessouky1
1Mentor, EG; 2Electric and Electronic Department, Faculty of Engineering and Natural Science, Yildirim Beyazit University, TR

Abstract
In this paper we present an adaptive meshing technique suitable for steady state finite element (FE) based thermal analysis of integrated circuits (ICs). The algorithm presented is a non iterative one where the technology used is first pre-characterized, having a trustworthy solution between central (c-) and distributed (d-) DTM techniques to find out the most suitable one for a given system is quite challenging. This is primarily due to the systemic difference between cDTM and dDTM controllers, and the inherent non-exhaustiveness of simulation and emulation methods conventionally used for DTM analysis. In this paper, we present a novel methodology called CAnDy-TM (stands for Comparative Analysis of Dynamic Thermal Management) that employs Model Checking to perform formal comparative analysis for cDTM and dDTM techniques. We identify a set of generic functional and performance properties to provide a common ground for their comparison. We demonstrate the usability and benefits of our methodology by comparing state-of-the-art cDTM and dDTM techniques, and illustrate which technique is good w.r.t. thermal stability and other task migration parameters. Such an analysis helps in selecting the most appropriate DTM for a given chip.

Download Paper (PDF; Only available from the DATE venue WiFi)

More information ...
More information ...
**UB09.8 TIDES: NON-LINEAR WAVEFORMS FOR QUICK TRACE NAVIGATION**

**Presenter:** Jannis Stoppe, University of Bremen, DE  
**Author:** Ralf Drechsler, University of Bremen / DFKI, DE  
**Abstract**  
System trace analysis is mostly done using waveform viewers -- tools that relate signals and their assignments at certain times. While generic hardware design is subject to some innovative visualisation ideas and software visualisation has been a research topic for much longer, these classic tools have been part of the design process since the earlier days of hardware design -- and have not changed much over the decades. Instead, the currently available programs have evolved to look practically the same, all following a familiar pattern that has not changed since their initial appearance. We argue that there is still room for innovation beyond the very classic waveform display though. We implemented a proof-of-concept waveform viewer (codenamed Tides) that has several unique features that go beyond the standard set of features for waveform viewers.

**More information ...**

**UB09.9 HEPSYCODE: A SYSTEM-LEVEL METHODOLOGY FOR HW/SW CO-DESIGN OF HETEROGENEOUS PARALLEL DEDICATED SYSTEMS**

**Presenter:** Luigi Pomante, University of L'Aquila, IT  
**Authors:** Giacomo Valente 1, Vittoriano Muttillo 1, Daniele Di Pompeo 1, Emilico Incerto 1 and Daniele Ciambrotte 1  
1University of L'Aquila, IT; Gran Sasso Science Institute, IT  
**Abstract**  
Heterogeneous parallel systems have been recently-exploited for a wide range of application domains, for both the dedicated (e.g. embedded) and the general purpose products. Such systems can include different processor cores, memories, dedicated ICs and a set of connections between them. They are so complex that the design methodology plays a major role in determining the success of the products. So, this demo addresses the problem of the electronic system-level hw/sw co-design of heterogeneous parallel dedicated systems. In particular, it shows an enhanced CSP/SystemC-based design space exploration step (and related ESL-EDA prototype tools), in the context of an existing hw/sw co-design flow that, given the system specification and related F/NF requirements, is able to (semi)automatically propose to the designer: - a custom heterogeneous parallel architecture; - an HW/SW partitioning of the application; - a mapping of the partitioned entities onto the proposed architecture.

**More information ...**

**UB09.10 PULP: A ULTRA-LOW POWER PLATFORM FOR THE INTERNET-OF-THINGS**

**Presenter:** Francesco Conti, ETH Zurich, CH  
**Authors:** Stefano Mach 1, Florian Zaruba 1, Antonio Pullini 1, Daniele Palossi 1, Giovanni Rovere 1, Florian Glaser 1, Germain Haugou 1, Schekeb Fateh 1 and Luca Benini 1  
1ETH Zurich, CH; 2ETH Zurich, CH and University of Bologna, IT  
**Abstract**  
The PULP (Parallel Ultra-Low Power) platform strives to provide high performance for IoT nodes and endpoints within a very small power envelope. The PULP platform is based on a tightly-coupled multi-core cluster and on a modular architecture, which can support complex configurations with autonomous I/O without SW intervention. HW-accelerated execution of hot computation kernels, fine-grain event-based computation - but can also be deployed in very simple configuration, such as the open source PULPino microcontroller. In this demonstration booth, we will showcase several prototypes using PULP chips in various configuration. Our prototypes perform demos such as real-time deep-learning based visual recognition from a low-power camera, and online biosignal acquisition and reconstruction on the same chip. Application scenarios for our technology include healthcare wearables, autonomous nano-UAVs, smart networked environmental sensors.

**More information ...**

---

**10.1 Wearable and Smart Medical Devices Day: Diagnosis and prevention systems**

**Date:** Thursday 30 March 2017  
**Time:** 11:00 - 12:30  
**Location / Room:** SBC

**organisers:**  
José L. Ayala, Universidad Complutense de Madrid, ES  
Chris Van Hoof, IMEC, BE  

**Chair:** Olivier Romain, Université de Cergy-Pontoise, FR  
**Co-Chair:** Mario Konijnenburg, IMEC, BE  

This session will present novel approaches, techniques and devices for the improvement of diagnosis and prevention systems. Improved bioanalytics-on-chip designs, wearables in the prevention of elderly, computational mechanisms for prevention of symptoms, and bioelectronics medicines will be covered.

**Time | Label | Presentation Title**
--- | --- | ---
11:00 | 10.1.1 | **ENABLING TECHNOLOGIES FOR NEXT GENERATION BIOANALYTICS ON CHIP**

**Authors:** Carlota Guiducci, EPFL, CH  
**Abstract**  
The adoption of lab-on-chip based solutions in clinical practice and in the framework of the most common bioanalytics protocols has long been sought for the possibility to fine control the movement of fluids and the flow of molecules and particles. Nevertheless, the existing solutions inherently limit both throughput and the possibility to sense and manipulate single particles. A few years ago, we undertook a major challenge in this context, starting from the consideration that the lack of solutions to localize electric fields in micro-regions and to control their distribution over the height of the chambers fundamentally limited the efficiency and the scalability of these systems. Our strategy, based on monolithic process, results in highly conductive and singularly addressable vertical microelectrodes, fully integrated in high aspect-ratio microfluidics. We have applied this novel process to develop a new generation of microfluidic flow cytometers that could successfully detect, for the first time, activated T lymphocytes in a cellular sample. In this talk we will describe as well our contribution to the integration of biosensors on IC layers and to solve the issues related to the specific surface treatments involved in the analytical protocol.
<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>11:20</td>
<td>10.1.2</td>
<td>BIOELECTRONICS MEDICINES - BRIDGING BIOLOGY WITH TECHNOLOGY</td>
<td>Firat Yazicioglu, GSK, BE</td>
</tr>
<tr>
<td>11:45</td>
<td>10.1.3</td>
<td>AN OPTIMAL APPROACH FOR LOW-POWER MIGRAINE PREDICTION MODELS IN THE STATE-OF-THE-ART WIRELESS MONITORING DEVICES</td>
<td>Josué Pagán, Universidad Complutense de Madrid, ES</td>
</tr>
</tbody>
</table>

**Abstract**

Wearable monitoring devices for ubiquitous health care are becoming a reality that has to deal with limited battery autonomy. Several researchers focus their efforts in reducing the energy consumption of these motes: from efficient micro-architectures, to on-node data processing techniques. In this paper we focus in the optimization of the energy consumption of monitoring devices for the prediction of symptomatic events in chronic diseases in real time. To do this, we have developed an optimization methodology that incorporates information of several sources of energy consumption: the running code for prediction, and the sensors for data acquisition. As a result of our methodology, we are able to improve the energy consumption of the computing process up to 90% with a minimal impact on accuracy. The proposed optimization methodology can be applied to any prediction modeling scheme to introduce the concept of energy efficiency. In this work we test the framework using Grammatical Evolutionary algorithms in the prediction of chronic migraines.

[Download Paper (PDF; Only available from the DATE venue WiFi)]

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>12:05</td>
<td>10.1.4</td>
<td>WEARABLE ELECTRONICS - WHAT IS IT GOOD FOR - AND WHAT IS MISSING TO SUPPORT THE QUALITY OF LIFE OF ELDERLY PEOPLE?</td>
<td>Ralf Brederlow, Kilby Labs at Texas Instruments, DE</td>
</tr>
</tbody>
</table>

**Abstract**

In this work we test the framework using Grammatical Evolutionary algorithms in the prediction of chronic migraines. Download Paper (PDF; Only available from the DATE venue WiFi)
10.3 Side-Channel Attacks

Date: Thursday 30 March 2017
Time: 11:00 - 12:30
Location / Room: 2BC
Chair: Oscar Reparaz, Katholieke Universiteit Leuven, BE
Co-Chair: Wieland Fischer, Infineon Technologies, DE

This session introduces new side-channel attacks techniques against cryptographic primitives, namely leakage resilient protocols and storage encryption based on AES. Also a power measurement setup specifically targeting static power consumption is presented and evaluated from the side-channel attack viewpoint.

10.3.1 SIDE-CHANNEL PLAINTEXT-RECOVERY ATTACKS ON LEAKAGE-RESILIENT ENCRYPTION

Speaker: Thomas Unterluggauer, Graz University of Technology, AT
Authors: Thomas Unterluggauer, Mario Werner and Stefan Mangard, Graz University of Technology, AT

Abstract
Differential power analysis (DPA) is a powerful tool to extract the key of a cryptographic implementation from observing its power consumption during the encryption/decryption of many different inputs. Therefore, cryptographic schemes based on frequent re-keying such as leakage-resilient encryption aim to inherently prevent DPA on the secret key by limiting the amount of data being processed under one key. However, the original asset of encryption, namely the plaintext, is disregarded. This paper builds on this observation and studies how the re-keying countermeasure does not only protect the secret key, but also induces another DPA vulnerability that allows for plaintext recovery. Namely, the frequent re-keying in leakage-resilient streaming modes causes constant plaintexts to be attackable through first-order DPA. Similarly, constant plaintexts can be revealed from re-keyed block ciphers using templates in a second-order DPA. Such plaintext recovery is particularly critical whenever long-term key material is encrypted and thus leaked. Besides leakage-resilient encryption, the presented attacks are also relevant for a wide range of other applications in practice that implicitly use re-keying, such as multi-party communication and memory encryption with random initialization for the key. Practical evaluations on both an FPGA and a microcontroller support the feasibility of the attacks and thus suggest the use of cryptographic implementations protected by mechanisms like masking in scenarios that require data encryption with multiple keys.

Download Paper (PDF; Only available from the DATE venue WiFi)

12:07 10.2.4 DESIGN AUTOMATION FOR QUANTUM ARCHITECTURES

Speaker: Martin Roetteler, Microsoft Research, US
Authors: Martin Roetteler, Krysta M. Svore, Dave Wecker and Nathan Wiebe, Microsoft, US

Abstract
We survey recent strides made towards building a software framework that is capable of compiling quantum algorithms from a high-level description down to physical gates that can be implemented on a fault-tolerant quantum computer. We discuss why compilation and design automation tools such as the ones in our framework are key for tackling the grand challenge of building a scalable quantum computer. We then describe specialized libraries that have been developed using the LIQUID programming language. This includes reversible circuits for arithmetic as well as new, truly quantum approaches that rely on quantum computer architectures that allow the probabilistic execution of gates, a model that can reduce time and space overheads in some cases. We highlight why these libraries are useful for the implementation of many quantum algorithms. Finally, we survey the tool REVs that facilitate resource efficient compilation of higher-level irreversible programs into lower-level reversible circuits while trying to optimize the memory footprint of the resulting reversible networks. This is motivated by the limited availability of qubits for the foreseeable future.

Download Paper (PDF; Only available from the DATE venue WiFi)
In this paper we present a new formal model, called p-FSM, for system-level power management design. The p-FSM is a modular, compositional, hierarchical, and unified model for hardware and software components. The model encapsulates power management control mechanisms, operating states and properties of a component that affect power, energy and thermal aspects of the system. Inter-component dependencies are modeled through a component-based interface. By connecting multiple p-FSMs we gradually compose the model of the whole system which ensures correct-by-construction system-level control sequencing. The model can also be used to formally verify the functional correctness of the power management design.

Download Paper (PDF; Only available from the DATE venue WiFi)
<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>11:00</td>
<td>10.4.1</td>
<td><strong>A FIELD PROGRAMMABLE TRANSISTOR ARRAY FEATURING SINGLE-CYCLE PARTIAL/FULL DYNAMIC RECONFIGURATION</strong></td>
<td>Carl Sechen, The University of Texas at Dallas, US</td>
<td>Jingxiang Tian, Gaurav Rajavendra Reddy, Jiajia Wang, William Swartz Jr., Yorgos Makris and Carl Sechen, The University of Texas at Dallas, US</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>We introduce a CMOS computational fabric consisting of carefully arranged regular rows and columns of transistors which can be individually configured and appropriately interconnected in order to implement a target digital circuit. Termed Field Programmable Transistor Array (FPTA), this novel reconfigurable architecture enables several highly-desirable features including (i) simultaneous storage of three configurations along with the ability to dynamically switch between them within a single cycle, while retaining the fabric's computational state, (ii) rapid partial or full modification of a stored configuration in a time proportional to the number of modified configuration bits through the use of hierarchically-arranged, high-throughput, asynchronously pipelined memory buffers, and (iii) support for libraries containing cells of the same height and variable width, just as in a typical standard cell circuit, thereby simplifying transition from a prototype to a custom IC design. Besides presenting the design details of this fabric in a 130nm technology and demonstrating the aforementioned capabilities, we also briefly discuss the development of a complete CAD flow for programing this fabric and we use numerous benchmark circuits to contrast its area efficiency against a typical FPGA implemented in the same technology node.</td>
<td>Zhongyuan Zhao, Department of NaNo/Micro Electronics, CN</td>
<td></td>
</tr>
<tr>
<td>11:30</td>
<td>10.4.2</td>
<td><strong>A POWER GATING SWITCH BOX ARCHITECTURE IN ROUTING NETWORK OF SRAM-BASED FPGAS IN DARK SILICON ERA</strong></td>
<td>Hossein Asadi, Sharif University of Technology, IR</td>
<td>Zeinab Seifoori, Behnam Khaleghi and Hossein Asadi, Sharif University of Technology, IR</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Continuous down scaling of CMOS technology in recent years has resulted in exponential increase in static power consumption which acts as a power wall for further transistor integration. One promising approach to throttle the substantial static power of Field-Programmable Gate Array (FPGAs) is to power off unused routing resources such as switch boxes, known as dark silicon. In this paper, we present a Power gating Switch Box Architecture (PESA) for routing network of SRAM-based FPGAs to overcome the obstacle for further device integration. In the proposed architecture, by exploring various patterns of used multipliers in switch boxes, we employ a configurable controller to turn off unused resources in the routing network. Our study shows that due to the significant percentage of unused switches in the routing network, PESA is able to considerably improve power efficiency in SRAM-based FPGAs. Experimental results carried out on different benchmarks using VPR toolset show that PESA decreases power consumption of the routing network up to 75% as compared to the conventional architectures while preserving the performance intact.</td>
<td>Zeinab Seifoori, Behnam Khaleghi and Hossein Asadi, Sharif University of Technology, IR</td>
<td></td>
</tr>
<tr>
<td>12:00</td>
<td>10.4.3</td>
<td><strong>A STATIC-PLACEMENT, DYNAMIC-ISSUE FRAMEWORK FOR CGRA LOOP ACCELERATOR</strong></td>
<td>Zhongyuan Zhao, Department of NaNo/Micro Electronics, CN</td>
<td>Zhongyuan Zhao, Weiguang Sheng, Wei Feng He, Zhi Gang Mao, and Zhaoshi Li</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>This paper presents a static-placement, dynamic issue (SPDI) framework for the coarse-grained reconfigurable architecture (CGRA) in order to tackle the inefficiencies of the static-issue, static-placement (SISP) CGRA. This framework includes the compiler that statically places the operations and hardware design, a SPDI CGRA, that automatically schedule the operations. We stress on introducing the SPDI CGRA in this paper. This newly designed hardware model adds the token buffer, which is capable of automatically scheduling the operations inside processing elements (PE), along with a router network that can effectively transform and control data flow among the PE array. This design lets the hardware share the responsibility for the compiler, making them cooperate to deal with the issuing, placement and routing problem. Evaluation of our study shows that our framework can reach on average 1.28, 1.30 and 1.33 higher than three state-of-the-art SISP CGRA using REGImap, RS compile flow and the EPIMap approaches respectively. The area overhead is nearly 0.93% per token buffer entry for each PE relative to SISP CGRA.</td>
<td>Zhongyuan Zhao, Weiguang Sheng, Wei Feng He, Zhi Gang Mao, and Zhaoshi Li</td>
<td></td>
</tr>
<tr>
<td>12:30</td>
<td></td>
<td>End of session</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Lunch Break in Garden Foyer</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Keynote Lecture session 11.0</strong> in &quot;Garden Foyer&quot; 1320 - 1350</td>
<td></td>
<td></td>
</tr>
<tr>
<td>10:5 Emerging NoC Directions</td>
<td></td>
<td><strong>Date:</strong> Thursday 30 March 2017</td>
<td><strong>Time:</strong> 11:00 - 12:30</td>
<td><strong>Location / Room:</strong> 3C</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Chair:</strong></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Jiang Xu, Hong Kong University of Science and Technology, HK</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Co-Chair:</strong></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Tushar Krishna, GeorgiaTech, US</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>This session presents papers on emerging directions in NoC design. The first paper uses machine learning for effective power management in NoCs. The next three papers use emerging technologies - wireless, 3D, and Optical - for efficient on-chip communications.</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
MACHINE LEARNING ENABLED POWER-AWARE NETWORK-ON-CHIP DESIGN

Speaker: Avinash Kodli, Ohio University, US
Authors: Dominic DiTomaso, Ashif Sikder, Avinash Kodli, and Ahmed Louri

Abstract: Although Network-on-Chips (NoCs) are fast becoming pervasive as the interconnect fabric for multicores architectures and systems-on-chips, they still suffer from excessive static and dynamic power consumption. High dynamic power consumption results from switching and storing data within routers/links while excessive static power is consumed when routers and links are not utilized for communication and yet have to be powered up. In this paper, we propose LESSON (Learning Enabled Sleepy Storage Links and Routers in NoCs) to reduce both static and dynamic power consumption by power-gating the links and routers at low network utilization and moving the data storage from within the routers to the links at high network utilization. As the network utilization increases from low-to-high, to accommodate more traffic, we design the same channels to flow traffic in either direction, thereby avoiding complex routing or look-ahead wake-up algorithms. Machine learning algorithms predict when to power-gate the channels and routers and when to increase the channel bandwidths such that power savings are maximized while performance penalty is minimized. Our results show that we can improve total network power consumption when compared to conventional NoC buffer designs by 85.6% and when compared with aggressive NoC buffer designs by 31.7%. Our predictor shows marginal performance penalties and by dynamically changing the direction of the links, we can improve packet latency by 14%.

Download Paper (PDF; Only available from the DATE venue WiFi)

PERFORMANCE EVALUATION AND DESIGN TRADE-OFFS FOR WIRELESS-ENABLED SMART NOC

Speaker: Karthi Duraisamy, Washington State University, US
Authors: Karthi Duraisamy and Partha Pande, Washington State University, US

Abstract: SMART (Single-Cycle Multi-hop Asynchronous Repeated Traversal) NoC architectures enable single cycle data transfers, even between the physically far apart nodes. However, enabling single cycle hops over long distance restricts the achievable clock frequency of the system. In other words, increasing the NoC clock frequency lowers the number of hops that can be traversed in a single-cycle in a conventional SMART NoC. In this work, we demonstrate that by integrating wireless links and a novel look-ahead request mechanism in the SMART NoC, it is possible to enable low-latency and energy efficient data transfers, even when the system is designed with high clock frequencies. For various applications considered in this work, our wireless-enabled SMART (WiSMART) NoC achieves on average 33% reduction in message latency compared to the wireline SMART mesh NoC. This network level improvement translates into 16% savings in full system energy-delay-product.

Download Paper (PDF; Only available from the DATE venue WiFi)

ROBUST TSV-BASED 3D NOC DESIGN TO COUNTERACT ELECTROMIGRATION AND CROSSSTALL NOISE

Speaker: Partha Pande, Washington State University, US
Authors: Sourav Das, Janardhan Rao Doppa, Partha Pande, and Krishnendu Chakrabarty

Abstract: A 3D network-on-chip (3D NoC) is an enabler for the design of high-performance and energy-efficient manycore chips. Most popular 3D NoCs utilize the Through-Silicon-Via (TSV)-based vertical links (VLS) as the communication pillars between the planar dies. However, the TSVs in a 3D NoC may fail due to both workload-induced stress and crosstalk capacitance. This failure negatively affects the overall achievable performance of the 3D NoC. In this work, we analyze the joint effects of workload-induced stress and crosstalk on the TSVs due to workload-induced stress then the estimated MTTF and the subsequently lifetime of 3D NoC are too optimistic. Due to the combined effects of workload and crosstalk noise, the lifetime of 3D NoC reduces significantly. Subsequently, we demonstrate that a spare TSV allocation methodology considering the joint effects of workload and crosstalk noise enhances the lifetime of the 3D NoC by a factor of 4.6 compared to when only the workload is considered for a given spare budget of 5%.

Download Paper (PDF; Only available from the DATE venue WiFi)

PERFORMANCE AND ENERGY AWARE WAVELENGTH ALLOCATION ON RING-BASED WDM 3D OPTICAL NOC

Speaker: Jiating Luo, INRIA/IRISA, FR
Authors: Jiating Luo, Ashraf Elantably, Pham Van Dung, Cedric Killian, Daniel Chillit, Sébastien Le Beux, Olivier Sentieys, and Jan D’Connor

Abstract: Optical Network-on-Chip (ONoC) is a promising communication medium for large-scale Multiprocessor System on Chip (MPSoC). ONoC outperforms classical electrical NoC in terms of throughput and latency. The medium can support multiple transactions at the same time on different wavelengths by using Wavelength Division Multiplexing (WDM). Moreover multiple wavelengths can be used as high-bandwidth channel to reduce transmission time. However, multiple signals sharing simultaneously a waveguide can lead to inter-channel crosstalk noise. This problem impacts the Signal to Noise Ratio (SNR) of the optical signal, which leads to an increase in the Bit Error Rate (BER) at the receiver side. In this paper we first formulate the crosstalk noise and execution time models and then propose a Wavelength Allocation (WA) method in a ring-based WDM ONoC allowing to search for performance and energy trade-offs, based on the application constraints. As result, most promising WA solutions are highlighted for a defined application mapping onto 16-core WDM ONoC.

Download Paper (PDF; Only available from the DATE venue WiFi)

End of session

Lunch Break in Garden Foyer

Keynote Lecture session 11.0 in "Garden Foyer" 1320 - 1350

Lunch Break in the Garden Foyer

On all conference days (Tuesday to Thursday), a buffet lunch will be offered in the Garden Foyer, in front of the session rooms. Kindly note that this is restricted to conference delegates possessing a lunch voucher only. When entering the lunch break area, delegates will be asked to present the corresponding lunch voucher of the day. Once the lunch area is being left, re-entrance is not allowed for the respective lunch.
In this session ideas related to approximate computing and neural networks are presented, which can be applied in novel communication and multimedia systems.

10.6.1 EXPLOITING SPECIAL-PURPOSE FUNCTION APPROXIMATION FOR HARDWARE-EFFICIENT QR-DECOMPOSITION
Speaker: Jochen Rust, University of Bremen, DE
Authors: Jochen Rust¹ and Steffen Paul²
¹University of Bremen, DE; ²University Bremen, DE
Abstract: Efficient signal processing takes a key role in application-specific circuit design. For instance, future mobile communication standards, e.g., high-performance industrial communication, require high data rates, low latency and/or high energy-efficiency. Hence, sophisticated algorithms and computing schemes must be explored to satisfy these challenging constraints. In this paper we leverage the paradigm of approximate computing to enable hardware-efficient QR-decomposition for channel precoding. For an efficient computation of the Givens-Rotation, bivariate, non-linear numeric functions are taken into account. An effective design method is introduced leading to highly adapted (special-purpose) functions. For evaluation, our work is tested with different configurations in a Tomlinson-Harashima precoding downlink environment. In addition, a corresponding HDL implementation is set up and logic and physical CMOS synthesis is performed. The comparison to actual references proves our work to be a powerful approach for future mobile communication systems.
Download Paper (PDF; Only available from the DATE venue WiFi)

10.6.2 EMBRACING APPROXIMATE COMPUTING FOR ENERGY-EFFICIENT MOTION ESTIMATION IN HIGH EFFICIENCY VIDEO CODING
Speaker: Muhammad Shaflque, Vienna University of Technology (TU Wien), AT
Authors: Walaa El-Harouni¹, Semeen Rehman², Bharath Srinivas Prabakaran², Akash Kumar³, Rehan Hafiz⁴ and Muhammad Shaflque⁵
¹Private Researcher, DE; ²Technische Universität Dresden, DE; ³Technische Univesitaet Dresden, DE; ⁴TU, PK; ⁵Vienna University of Technology (TU Wien), AT
Abstract: Approximate Computing is an emerging paradigm for developing highly energy-efficient computing systems. It leverages the inherent resilience of applications to trade off quality with energy efficiency. In this paper, we present a novel approximate architecture for energy-efficient motion estimation (ME) in high efficiency video coding (HEVC). We synthesized our designs for both ASIC and FPGA design flows. ModelSim gate-level simulations are used for functional and timing verification. We comprehensively analyze the impact of heterogeneous approximation modes on the power/energy-quality tradeoffs for various video sequences. To facilitate reproducible results for comparisons and further research and development, the RTL and behavioral models of approximate SAD architectures and constituting approximate modules are made available at https://sourceforge.net/projects/pacclib/.
Download Paper (PDF; Only available from the DATE venue WiFi)

10.6.3 HARDWARE ARCHITECTURE OF BIDIRECTIONAL LONG SHORT-TERM MEMORY NEURAL NETWORK FOR OPTICAL CHARACTERrecognition
Speaker: Vladimir Rybaklin, University of Kaiserslautern, DE
Authors: Vladimir Rybaklin¹, Mohammad Reza Yousefi², Norbert Wehn³ and Didier Stricker³
¹University of Kaiserslautern, DE; ²Augmented Vision Department, German Research Center for Artificial Intelligence (DFKI), DE; ³German Research Center for Artificial Intelligence (DFKI), DE
Abstract: Optical Character Recognition is the conversion of printed or handwritten text images into machine-encoded text. It is a building block of many processes such as machine reading, text-to-speech conversion, and text mining. Bidirectional Long Short-Term Memory Neural Networks have shown a superior performance in character recognition with respect to other types of neural networks. In this paper, to the best of our knowledge, we propose the first hardware architecture of Bidirectional Long Short-Term Memory Neural Network with Connectionist Temporal Classification for Optical Character Recognition. Based on the new architecture, we present an FPGA hardware accelerator that achieves 459 times higher throughput than state-of-the-art. Visual recognition is a typical task on mobile platforms that usually use two scenarios either the task runs locally on embedded processor or is offloaded to a cloud to be run on high performance machine. We show that computationally intensive visual recognition tasks benefit from being migrated to our dedicated hardware accelerator and outperforms high-performance CPU in terms of runtime, while consuming less energy than low power systems with negligible loss of recognition accuracy.
Download Paper (PDF; Only available from the DATE venue WiFi)

10.7 Adaptive and Resilient Cyber-Physical Systems
The session contains four regular papers and four IP papers addressing different aspects of adaptivity and resilience for Cyber-Physical Systems. The topic of the first paper is distributed architectures for deep neural networks executing on a set of mobile nodes. The second paper considers scheduling of imprecise computation tasks on MPSoC systems taking the uncertainty of harvested energy into account. The final two papers both considers resilience of CPS. The first presents a scheme for preventing GPS-based hijacking of drones and the last considers how to avoid adversaries from learning what is printed using a 3D printer. The four IP papers considers control and scheduling co-design, contract-based design, medical CPS, utility-driven data transmission strategies for CPS.

### Time  | Label  | Presentation Title                        | Authors
--- | --- | --- | ---
11:00 | 10.7.1 | EFFICIENT DRONE HIJACKING DETECTION USING ONBOARD MOTION SENSORS | Zhiwei Feng, Northeastern University, China, CN; Hong Kong Polytechnic University, HK; Chongqing University, CN; McGill University, CA
11:30 | 10.7.2 | ENERGY-ADAPTIVE SCHEDULING OF IMPRECISE COMPUTATION TASKS FOR QOS OPTIMIZATION IN REAL-TIME MPSOC SYSTEMS | Tongquan Wei, East China Normal University, CN
12:00 | 10.7.3 | FIX THE LEAK! AN INFORMATION LEAKAGE AWARE SECURED CYBER-PHYSICAL MANUFACTURING SYSTEM | Mohammad Al Faruque, UCI, US
12:15 | 10.7.4 | MODNN: LOCAL DISTRIBUTED MOBILE COMPUTING SYSTEM FOR DEEP NEURAL NETWORK | Kent W. Nixon, University of Pittsburgh, US; Jiachen Hao, Xiang Chen, Kent W. Nixon, Christopher Krieger and Yiran Chen
11:00 | 10.7.3 | ENERGY-ADAPTIVE SCHEDULING OF IMPRECISE COMPUTATION TASKS FOR QOS OPTIMIZATION IN REAL-TIME MPSOC SYSTEMS | Tongquan Wei, East China Normal University, CN
11:30 | 10.7.2 | EFFICIENT DRONE HIJACKING DETECTION USING ONBOARD MOTION SENSORS | Zhiwei Feng, Northeastern University, China, CN; Hong Kong Polytechnic University, HK; Chongqing University, CN; McGill University, CA
12:00 | 10.7.3 | FIX THE LEAK! AN INFORMATION LEAKAGE AWARE SECURED CYBER-PHYSICAL MANUFACTURING SYSTEM | Mohammad Al Faruque, UCI, US
12:15 | 10.7.4 | MODNN: LOCAL DISTRIBUTED MOBILE COMPUTING SYSTEM FOR DEEP NEURAL NETWORK | Kent W. Nixon, University of Pittsburgh, US; Jiachen Hao, Xiang Chen, Kent W. Nixon, Christopher Krieger and Yiran Chen
11:00 | 10.7.1 | EFFICIENT DRONE HIJACKING DETECTION USING ONBOARD MOTION SENSORS | Zhiwei Feng, Northeastern University, China, CN; Hong Kong Polytechnic University, HK; Chongqing University, CN; McGill University, CA
11:30 | 10.7.2 | ENERGY-ADAPTIVE SCHEDULING OF IMPRECISE COMPUTATION TASKS FOR QOS OPTIMIZATION IN REAL-TIME MPSOC SYSTEMS | Tongquan Wei, East China Normal University, CN
12:00 | 10.7.3 | FIX THE LEAK! AN INFORMATION LEAKAGE AWARE SECURED CYBER-PHYSICAL MANUFACTURING SYSTEM | Mohammad Al Faruque, UCI, US
12:15 | 10.7.4 | MODNN: LOCAL DISTRIBUTED MOBILE COMPUTING SYSTEM FOR DEEP NEURAL NETWORK | Kent W. Nixon, University of Pittsburgh, US; Jiachen Hao, Xiang Chen, Kent W. Nixon, Christopher Krieger and Yiran Chen
<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>12:30</td>
<td>IPS-3</td>
<td>ANOMALIES IN SCHEDULING CONTROL APPLICATIONS AND DESIGN COMPLEXITY</td>
<td>Amir Aminifar, Swiss Federal Institute of Technology in Lausanne, CH; Enrico Bini, EPFL, CH;</td>
</tr>
<tr>
<td></td>
<td>813</td>
<td></td>
<td>University of Turin, IT</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Abstract</strong></td>
<td>Today, many control applications in cyber-physical systems are implemented on shared platforms. Such resource sharing may lead to complex timing behaviors and, in turn, instability of control applications. This paper highlights a number of anomalies demonstrating complex timing behaviors caused as a result of resource sharing. Such anomalous scenarios, then, lead to a dramatic increase in design complexity, if not properly considered. Here, we demonstrate that these anomalies are, in fact, very improbable. Therefore, design methodologies for these systems should mainly be devised and tuned towards the majority of cases, as opposed to anomalies, but should also be able to handle such anomalous scenarios.</td>
</tr>
<tr>
<td>12:32</td>
<td>IPS-5</td>
<td>MODELING AND INTEGRATING PHYSICAL ENVIRONMENT ASSUMPTIONS IN MEDICAL CYBER-PHYSICAL SYSTEM DESIGN</td>
<td>Chunhui Guo, Illinois Institute of Technology, US; Chunhui Guo, Illinois Institute of Technology, US; Tsinghua University, CN; University of Illinois at Urbana-Champaign, US</td>
</tr>
<tr>
<td>12:33</td>
<td>IPS-6</td>
<td>A UTILITY-DRIVEN DATA TRANSMISSION OPTIMIZATION STRATEGY IN LARGE SCALE CYBER-PHYSICAL SYSTEMS</td>
<td>Bei Yu, The Chinese University of Hong Kong, HK; Soumi Chattopadhyay, Indian Statistical Institute, IN; The Chinese University of Hong Kong, HK</td>
</tr>
<tr>
<td>12:30</td>
<td></td>
<td><strong>Abstract</strong></td>
<td>Implicit physical environment assumptions made by safety critical cyber-physical systems, such as medical cyber-physical systems (M-CPS), can lead to catastrophes. Several recent U.S. Food and Drug Administration (FDA) medical device recalls are due to implicit physical environment assumptions. In this paper, we develop a mathematical assumption model and composition rules that allow M-CPS engineers to explicitly and precisely specify assumptions about the physical environment in which the designed M-CPS operates. Algorithms are developed to integrate the mathematical assumption model with system model so that the safety of the system can be not only validated by both medical and engineering professionals but also formally verified by existing formal verification tools. We use an FDA recalled medical ventilator scenario as a case study to show how the mathematical assumption model and its integration in M-CPS design may improve the safety of the ventilator and M-CPS in general.</td>
</tr>
</tbody>
</table>

**10.8a Smart and Wearable Sensors for Health**

**Date:** Thursday 30 March 2017  
**Time:** 11:00 - 12:00  
**Location / Room:** Exhibition Theatre  
**Organiser:**  
Patrick Mayor, EPFL, CH  
**Moderator:**  
Martin Rajman, EPFL, CH  

The goal of this session is to present three concrete examples of innovative wearable devices: a contactless monitoring system using dedicated imaging to accurately measure heart and respiratory rates of neonates, wearable devices integrated in smart textiles for the long-term monitoring of obese patients, as well as a prototype of next-generation, high-quality, mobile ultrasound imaging device.
11:00 10.8a.1 NEWBORNCARE
Speaker: Martin Wolf, USZ, CH

11:20 10.8a.2 OBESENSE
Speaker: Jean-Philippe Thiran, EPFL, CH

11:40 10.8a.3 ULTRASOUNDTOGO
Speaker: Federico Angiolini, EPFL, CH

12:00 End of session
12:30 Lunch Break in Garden Foyer

Keynote Lecture session 11.0 in 'Garden Foyer' 1320 - 1350

Lunch Break in the Garden Foyer
On all conference days (Tuesday to Thursday), a buffet lunch will be offered in the Garden Foyer, in front of the session rooms. Kindly note that this is restricted to conference delegates possessing a lunch voucher only. When entering the lunch break area, delegates will be asked to present the corresponding lunch voucher of the day. Once the lunch area is being left, re-entrance is not allowed for the respective lunch.

10.8b IoT Edge Devices

Date: Thursday 30 March 2017
Time: 12:00 - 12:30
Location / Room: Exhibition Theatre

12:00 10.8b.1 MENTOR'S CUSTOM/ANALOG SOLUTIONS FOR IOT EDGE DEVICES
Speaker: Nicolas Williams, Mentor, US

12:30 End of session
12:30 Lunch Break in Garden Foyer

Keynote Lecture session 11.0 in 'Garden Foyer' 1320 - 1350

Lunch Break in the Garden Foyer
On all conference days (Tuesday to Thursday), a buffet lunch will be offered in the Garden Foyer, in front of the session rooms. Kindly note that this is restricted to conference delegates possessing a lunch voucher only. When entering the lunch break area, delegates will be asked to present the corresponding lunch voucher of the day. Once the lunch area is being left, re-entrance is not allowed for the respective lunch.

UB10 Session 10

Date: Thursday 30 March 2017
Time: 12:00 - 14:30
Location / Room: Booth 1, Exhibition Area

UB10.1 A FRAMEWORK FOR VARIATION-AWARE ANALOG CIRCUITS SIZING
Presenter: Ons Lahiouel, Concordia University, CA
Authors: Mohamed H. Zaki and Sofiene Tahar, Concordia University, CA

Abstract
Today's analog design faces significant challenges due to circuit complexity and short time-to-market windows. The proposed demonstration presents new techniques for enhancing variation-aware circuits sizing. The sizing problem is encoded using nonlinear constraints. A new algorithm using Satisfiability Modulo Theory (SMT) solving techniques exhaustively explores the design space and computes a continuous set of feasible sizing solutions. Two methods for the computation of parametric yield are implemented. The first method combines the advantages of sparse regression and SMT solving techniques for reliable and accelerated yield estimation. The second approach employs a statistical classifier to reduce the number of simulations. An optimization process using a two-step exploration strategy is also integrated to find the feasible design point with the highest yield. Experimental results show that our approach locates higher quality of design point within less run time.

More information ...

UB10.2 TFA: TRANSPARENT CODE OFFLOADING ON FPGA
Presenter: Roberto Rigamonti, HEIG-VD/HES-SO, CH
Authors: Anthony Convers, Baptiste Delporte, Xavier Ruppen and Alberto Dassatti, HEIG-VD/HES-SO, CH

Abstract
Genomics, molecular dynamics, and machine learning are just the most recent examples of fields where FPGAs could provide the means to achieve interesting breakthroughs. However, HDL programming requires considerable multi-disciplinary skills, experience, large budgets, time, and a bit of wizardry. Given that most implementations are short-lived, the investment simply does not pay off. In this demo we propose a multi-vendor LLVM-based automated framework that can transparently - without the user or developer being aware of it - offload computing-intensive code fragments to FPGAs. The system relies on a performance monitor to detect computing-intensive code sections and, if they are suitable for offloading, extracts the Data Flow Graph and uses it to program an overlay pre-programmed on the FPGA, which then interacts with the Just-In-Time compiler executing the program. The overall process requires hundreds of microseconds, and can be easily reverted should the outcome be unsatisfactory.

More information ...
More information ...
Sani R. Nassif, Radyalis LLC, US

THE ENGINEERING TO MEDICINE METAMORPHOSIS

Abstract
We EDA engineers are justifiably proud of the tremendous success that integrated electronics has enjoyed over the last 50 years. After all the world has been irrevocably changed by the pervasive connectivity and computing capability we have enabled. Today’s smart devices are just the beginning of an avalanche of “intelligence” that will be enabled by the internet of things and further change our lives for the better. But it can sometimes be difficult to explain to a layperson what we have played in this narrative, somehow a 2% improvement in routing density or simulation accuracy sounds quite far from “the next iPhone”. As technology slows down, matures, and the industry consolidates, we are presented with opportunities for applying our talents for the analysis, modeling, optimization and solution of difficult large scale problems in adjacent fields. This talk is about one such opportunity in the area of radiation therapy, where Medical Physicists work hand-in-hand with Oncologists to provide life-saving treatments for Cancer. Making the transition from EDA to Medicine required some significant sacrifices and humility -but the end result is a commercial and scientific success and a far greater level of relevance to people’s lives.
### Design Challenges for Wearable EMG Applications

**Speaker:**
Elisabetta Farella, Fondazione Bruno Kessler - ICT Center, IT

**Authors:**
Bojan Milesevic¹, Simone Benatti² and Elisabetta Farella¹

| 1Fondazione Bruno Kessler (FBK), IT; 2Università di Bologna, IT |

**Abstract:**
Wearable technologies are changing the way we deal with health and fitness in our daily life. Nevertheless, while MEMS-enabled inertial sensors have conquered the consumer market, physiological monitoring has still to face barriers due to the complexity and costs of physical interfaces (e.g. electrodes), the degree of intuitiveness of the interaction and the processing required to reach satisfying accuracy. These limitations are mitigated by the embedded systems’ growing integration of interfacing capabilities and efficient computing power. In this paper, we describe the main applications and the related technologies for the acquisition and processing of myoelectric (EMG) signals. Starting from well established active sensors and bench-top setups, we introduce a recent design based on the combination of an integrated Analog Front End (AFE) and embedded processing. This solution provides high quality signal acquisition and on-board digital processing capabilities with a contained power consumption. The system was tested within the prosthesis control application scenario, one of the most stringent EMG applications, achieving a 90% gesture recognition accuracy with real time on-board processing at a power consumption of 30mW. Such promising results highlight the current trend in shifting EMG applications from dedicated analog solutions towards integrated digital devices, favouring the development of advanced, modular and low-power wearable solutions.

Download Paper (PDF; Only available from the DATE venue WiFi)
End of session
Coffee Break in Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017
- Coffee Break 10:30 - 11:30
- Coffee Break 16:00 - 17:00

Wednesday, March 29, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 16:00 - 17:00

Thursday, March 30, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 15:30 - 16:00

11.2 Emerging Technologies for Future Memory Design

**Date:** Thursday 30 March 2017  
**Time:** 14:00 - 15:30  
**Location / Room:** 4BC

**Chair:** Weisheng Zhao, Beihang University, CN  
**Co-Chair:** Jean-Michel Portal, Aix-Marseille Université, FR

Memory design based on emerging technologies is critical for the future VLSI design targeting low power and high performance. This session involves novel design method and evaluation tool for emerging technologies (i.e. STT-MRAM, Racetrack memory, Phase Change Memory and Ferroelectric memory etc.) including variation aware design, novel architecture implementation and reliability concern.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>15:30</td>
<td>End of session</td>
<td>Coffee Break in Exhibition Area</td>
<td></td>
</tr>
<tr>
<td>16:00</td>
<td>Coffee Break</td>
<td></td>
<td></td>
</tr>
<tr>
<td>17:00</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

11.2.1 (Best Paper Award Candidate)

**HYBRID VC-MTJ/CMOS NON-VOLATILE STOCHASTIC LOGIC FOR EFFICIENT COMPUTING**

**Speaker:** Shaodi Wang, University of California, Los Angeles, US  
**Authors:** Shaodi Wang\(^1\), Septdeep Pal\(^1\), Tianmu Li\(^2\), Andrew Pan\(^2\), Cecile Grezes\(^2\), Pedram Khalili-Amiri\(^2\), Kang L. Wang\(^2\) and Puneet Gupta\(^2\)

\(^1\)University of California, Los Angeles, US; \(^2\)UCLA, US

**Abstract**

In this paper, we propose a non-volatile stochastic computing (SC) scheme using voltage-controlled magnetic tunnel junction (VC-MTJ) and negative differential resistance (NDR). The proposed design includes a VC-MTJ based true stochastic bit stream generator and VC-MTJ and NDR based stochastic adder, multiplier, register, which are experimentally demonstrated using 60nm VC-MTJ and CMOS NDR connected on die. These components are then used to realize FIR filter and AdaBoost (machine-learning algorithm). 3X - 37X energy advantage is shown for the proposed SC compared with CMOS binary arithmetic ASIC and SC designs.

Download Paper (PDF; Only available from the DATE venue WiFi)

11.2.2

**DESIGN AND BENCHMARKING OF FERROELECTRIC FET BASED TCAM**

**Speaker:** Xunzhao Yin, University of Notre Dame, US  
**Authors:** Xunzhao Yin, Michael Niemier and X. Sharon Hu, University of Notre Dame, US

**Abstract**

We consider how emerging transistor technologies, specifically ferroelectric field effect transistors (or FeFETs), can realize compact and energy efficient ternary content addressable memories (TCAMs). As Moore's Law-based performance scaling trends slow, and many computational tasks of interest are now more data-centric than compute-centric, researchers are looking to improve performance/save energy by integrating efficient and compact logic-processing elements into various levels of the memory hierarchy. Potential benefits include reduced I/O traffic, energy/delay from data transfers, etc. A TCAM is an example of a logic-in-memory element that is ubiquitous in routers, caches, databases, and even neural networks. Not surprisingly, researchers continue to study how emerging technologies could lead to improved TCAMs. Recent work has considered how non-volatile (NV) memory technologies (e.g., resistive random access memory (ReRAM) or magnetic tunnel junctions (MTJs)) could best be used to construct low energy, NV TCAMs. However, acceptable Ron-Roff ratios and the two terminal nature of these devices introduce energy and area overheads. Due to hysteresis in a device’s I-V curve, an FeFET-based NV TCAM offers low area overhead, as well as search energies and search speeds that are superior to other TCAM designs (i.e., based on MTJ, ReRAM and CMOS in array- and architectural-level evaluations.)

Download Paper (PDF; Only available from the DATE venue WiFi)
15:00  11.2.3  LEVERAGING ACCESS PORT POSITIONS TO ACCELERATE PAGE TABLE WALK IN DWM MAIN MEMORY

**Speaker:**
Chengmo Yang, University of Delaware, US

**Authors:**
Hoda Aghaei Khouzani 1, Pouya Fotouhi 2, Chengmo Yang 1 and Guang R. Gao 2
1University of Delaware, US; 2Department of Electrical and Computer Engineering, University of Delaware, US

**Abstract**
Domain Wall Memory (DWM) with ultra-high density and comparable read/write latency to SRAM/DRAM is an attractive replacement for CMOS-based devices. Unlike SRAM/DRAM, DWM has non-uniform data access latency that is proportional to the number of shift operations. While previous works have demonstrated the feasibility of using DWM as main memory and have proposed different ways to alleviate the impact of shift operations, none of them have addressed the performance-critical metadata accesses, in particular page table accesses. To bridge this gap, this paper aims at accelerating page table walk in DWM main memory from two innovative aspects. First of all, we propose a new page table layout and leverage the positions of access ports in DWM to differentiate the state of page table entries. In addition, we propose a technique to pre-align the access ports to the positions to be accessed in the near future, thus hiding shift latency to the maximum extent. Since both address translation and context switching are affected by page table access latency, the proposed technique can effectively improve system performance and user experience.

Download Paper (PDF; Only available from the DATE venue WiFi)

15:15  11.2.4  VAET-ST: A VARIATION AWARE ESTIMATOR TOOL FOR STT-MRAM BASED MEMORIES

**Speaker:**
Sarath Mohanachandran Nair, KIT, Germany, DE

**Authors:**
Sarath Mohanachandran Nair 1, Rajendra Bishnoi 2, Mohammad Saber Golanbari 1, Fabian Oboril 1 and Mehdia Tahoori 1
1Karlsruhe Institute of Technology, DE; 2Karlsruhe Institute of Technology, DE

**Abstract**
Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) is a promising candidate to replace CMOS based on-chip memories due to its advantages such as non-volatility, high density and scalability. However, its stochastic switching and higher sensitivity to process variation compared to CMOS memories can significantly affect its performance, energy and reliability. Although a few works exist which analyze the impact of process variation at the bit-cell level, such analysis at the system level is missing. We have bridged this gap in our work. Specifically, we quantify the effect of stochasticity and process variations from the cell-level to the overall memory system and perform a variation-aware memory configuration optimization for energy or performance while meeting reliability constraints. Our system-level variation-aware framework has been built on top of the well-known NVSim engine. The results show that our framework can provide more realistic margins and the optimized variation-aware memory configuration could be significantly different from the conventional framework.

Download Paper (PDF; Only available from the DATE venue WiFi)

15:30  IPS-7, 52  PROTECT NON-VOLATILE MEMORY FROM WEAR-OUT ATTACK BASED ON TIMING DIFFERENCE OF ROW BUFFER HIT/MISS

**Speaker:**
Haiyu Mao, Tsinghua University, CN

**Authors:**
Haiyu Mao 1, Xian Zhang 2, Guangyu Sun 2 and Jiwu Shu 1
1Tsinghua University, CN; 2Peking University, CN

**Abstract**
Non-volatile Memories (NVMs), such as PCM and ReRAM, have been widely proposed for future main memory design because of their low standby power, high storage density, fast access speed. However, these NVMs suffer from the write endurance problem. In order to prevent a malicious program from wearing out NVMs deliberately, researchers have proposed various wear-leveling methods, which remap logical addresses to physical addresses randomly and dynamically. However, we discover that side channel leakage based on NVM row buffer hit information can reveal details of address remappings. Consequently, it can be leveraged to side-step the wear-leveling. Our simulation shows that the proposed attack method in this paper can wear out a NVM within 137 seconds, even with the protection of state-of-the-art wear-leveling schemes. To counteract this attack, we further introduce an effective countermeasure named Intra-Row Swap (IRS) to hide the wear-leveling details. The basic idea is to enable an additional intra-row block swap when a new logical address is remapped to the memory row. Experiments demonstrate that IRS can secure NVMs with negligible timing/energy overhead, compared with previous works.

Download Paper (PDF; Only available from the DATE venue WiFi)

15:32  IPS-8, 622  EFFECTS OF CELL SHAPES ON THE ROUTABILITY OF DIGITAL MICROFLUIDIC BIOCHIPS

**Speaker:**
Oliver Keszöcze, University of Bremen, DE

**Authors:**
Kevin Leonard Schneider 1, Oliver Keszöcze 1, Jannis Stoppe 1 and Rolf Drechsler 2
1University of Bremen, DE; 2University of Bremen/DFKI GmbH, DE

**Abstract**
Digital microfluidic biochips (DMFBs) are an emerging technology promising a high degree of automation in laboratory procedures by means of manipulating small discretized amounts of fluids. A crucial part in conducting experiments on biochips is the routing of discretized droplets. While doing so, droplets must not enter each others’ interference region to avoid unintended mixing. This leads to cells in the proximity of the droplet being impassable for others. For different cell shapes, the effect of these temporary blockages varies as the adjacency of cells changes with their shapes. Yet, no evaluation with respect to routability in relation to cell shapes has been conducted so far. This paper analyses and compares various tessellations for the field of cells. Routing benchmarks are mapped to these and the results are compared in order to determine if and how cell shapes affect the performance of DMFBs, showing that certain cell shapes are superior to others.

Download Paper (PDF; Only available from the DATE venue WiFi)
**A MECHANISM FOR ENERGY-EFFICIENT REUSE OF DECODING AND SCHEDULING OF X86 INSTRUCTION STREAMS**

**Speaker:**
Antonio Carlos S. Beck, Universidade Federal do Rio Grande do Sul, BR

**Authors:**
Marcelo Brandalero and Antonio Carlos Schneider Beck, Universidade Federal do Rio Grande do Sul, BR

**Abstract**
Current superscalar x86 processors decompose each CISC instruction (variable-length and with multiple addressing modes) into multiple RISC-like µops at runtime so they can be pipelined and scheduled for concurrent execution. This challenging and power-hungry process, however, is usually repeated several times on the same instruction sequence, inefficiently producing the very same decoded and scheduled µops. Therefore, we propose a transparent mechanism to save the decoding and scheduling transformation for later reuse, so that next time the same instruction sequence is found it can automatically bypass the costly pipeline stages involved. We use a coarse-grained reconfigurable array as a means to save this transformation, since its structure enables the recovery to save the decoding and scheduling transformation for later reuse, so that next time the same instruction sequence is found it can automatically bypass the costly pipeline stages involved. We use a coarse-grained reconfigurable array as a means to save this transformation, since its structure enables the recovery.

**Download Paper (PDF; Only available from the DATE venue WiFi)**

---

**UNDERSTANDING THE IMPACT OF PRECISION QUANTIZATION ON THE ACCURACY AND ENERGY OF NEURAL NETWORKS**

**Speaker:**
Sharief Reda, Brown University, US

**Authors:**
Soheil Hashemi, Nicholas Anthony, Hokchhay Tann, Iris Bahar and Sharief Reda, Brown University, US

**Abstract**
Deep neural networks are gaining in popularity as they are used to generate state-of-the-art results for a variety of computer vision and machine learning applications. At the same time, these networks have grown in depth and complexity in order to solve harder problems. Given the limitations in power budgets dedicated to these networks, the importance of low-power, low-memory solutions has been stressed in recent years. While a large number of dedicated hardware using different precisions has recently been proposed, there exists no comprehensive study of different bit precisions and arithmetic in both inputs and network parameters. In this work, we address this issue and perform a study of different bit-precisions in neural networks (from floating-point to fixed-point, powers of two, and binary). In our evaluation, we consider and analyze the effect of precision scaling on both network accuracy and hardware metrics including memory footprint, power and energy consumption, and design area. We also investigate training-time methodologies to compensate for the reduction in accuracy due to limited bit precision and demonstrate that in most cases, precision scaling can deliver significant benefits in design metrics at the cost of very modest decreases in network accuracy. In addition, we propose that a small portion of the benefits achieved when using lower precisions can be forfeited to increase the network size and therefore the accuracy. We evaluate our experiments, using three well-recognized networks and datasets to show its generality. We investigate the trade-offs and highlight the benefits of using lower precisions in terms of energy and memory footprint.

**Download Paper (PDF; Only available from the DATE venue WiFi)**
### 11.4 Advances in Timing and Layout

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>15:15</td>
<td>11.3.4</td>
<td>BIG VS LITTLE CORE FOR ENERGY-EFFICIENT HADOOP COMPUTING</td>
<td>Speaker: Houman Homayoun, George Mason University, US</td>
</tr>
</tbody>
</table>
|            |       |                                                          | Authors:  
|            |       |                                                          | Maria Malik, Katayoun Neshatpour, Tinoosh Mohsenin 1, Avesta Sasan 1, and Houman Homayoun 1  
|            |       |                                                          | George Mason University, US; University of Maryland Baltimore County, US |
|            |       |                                                          | **Abstract**  
|            |       |                                                          | The rapid growth in the data yields challenges to process data efficiently using current high-performance server architectures such as big Xeon cores. Furthermore, physical design constraints, such as power and density, have become the dominant limiting factor for scaling out servers. Heterogeneous architectures that combine big Xeon cores with little Atom cores have emerged as a promising solution to enhance energy-efficiency by allowing each application to run on an architecture that matches resource needs more closely than a one-size-fits-all architecture. Therefore, the question of whether to map the application to big Xeon or little Atom in heterogeneous server architecture becomes important. In this paper, we characterize Hadoop-based applications and their corresponding MapReduce tasks on big Xeon and little Atom-based server architectures to understand how the choice of big vs little cores is affected by various parameters at application, system and architecture levels and the interplay among these parameters. Furthermore, we have evaluated the operational and the capital cost to understand how performance, power and area constraints for big data analytics affects the choice of big vs little core server as a more cost and energy efficient architecture. |
|            |       |                                                          | Download Paper (PDF; Only available from the DATE venue WiFi)             |
| 15:30      | IPS-9 | LESS: BIG DATA SKETCHING AND ENCRYPTION ON LOW POWER PLATFORM | Speaker: Amey Kulkarni, University of Maryland Baltimore County, US       |
|            | 763   |                                                          | Authors:  
|            |       |                                                          | Amey Kulkarni 1, Colin Shea 2, Houman Homayoun 3 and Tinoosh Mohsenin 2  
|            |       |                                                          | 1University of Maryland, Baltimore County, US; 2University of Maryland Baltimore County, US; 3George Mason University, US |
|            |       |                                                          | **Abstract**  
|            |       |                                                          | Every-growing IoT demands big data processing and cognitive computing on mobile and battery operated devices. However, big data processing on low power embedded cores is challenging due to their limited communication bandwidth and on-chip storage. Additionally, IoT and cloud-based computing demand low overhead security kernel to avoid data breaches. In this paper, we propose a Lightweight Encryption using Scalable Sketching (LESS) framework for big data sketching and encryption using One-Time Random Linear Projections (ORTLP). ORTLP encoded matrix makes the Known Plaintext Attacks (KPA) ineffective, and attackers cannot gain significant information from plaintext-ciphertext pair. LESS framework can reduce data up to 67% with 3.81×dB signal-to-reconstruction error rate (SRER). This framework has two important kernels “sketching” and “sketch-reconstruction”; the latter is computationally intensive and costly. We propose to accelerate the sketch reconstruction using Orthogonal Matching Pursuit (OMP) on a domain specific many-core hardware named Power Efficient Nano Cluster (PENC) designed by authors. Detailed performance and power analysis suggests that PENC platform has 15x and 200x less energy consumption and 8x and 177x faster reconstruction time as compared to low power ARM CPU and K1 GPU, respectively. To demonstrate efficiency of LESS framework, we integrate it with Hadoop MapReduce platform for objects and scenes identification application. The full hardware integration consists of tiny ARM cores which perform task scheduling and objects identification application, while PENC acts as an accelerator for sketch reconstruction. The full hardware integration results show that the LESS framework achieves 46% reduction in data transfers with very low execution overhead of 0.11% and negligible energy overhead of 0.001% when tested for 2.6GB streaming input data. The heterogeneous LESS framework requires 2x less transfer time and achieves 2.25x higher throughput per watt compared to MapReduce platform. |
|            |       |                                                          | Download Paper (PDF; Only available from the DATE venue WiFi)             |
| 15:31      | IPS-10| TRUNCAPP: A TRUNCATION-BASED APPROXIMATE DIVIDER FOR ENERGY EFFICIENT DSP APPLICATIONS | Speaker: Shaghayegh Vahdat, University of Tehran, IR                    |
|            | 656   |                                                          | Authors:  
|            |       |                                                          | Shaghayegh Vahdat 1, Mehdi Kamal 1, Ali Afzali-Kusha 1, Zainalabedin Navabi 1 and Massoud Pedram 2  
|            |       |                                                          | 1University of Tehran, IR; 2University of Southern California, US |
|            |       |                                                          | **Abstract**  
|            |       |                                                          | In this paper, we present a high speed yet energy efficient approximate divider where the division operation is performed by multiplying the dividend by the inverse of the divisor. In this structure, truncated value of the dividend is multiplied exactly (approximately) by the approximate inverse value of divisor. To assess the efficacy of the proposed divider, its design parameters are extracted and compared to those of a number of prior art dividers in a 45nm CMOS technology. Results reveal that this structure provides 66% and 52% improvements in the area and energy consumption, respectively, compared to the most advanced prior art approximate divider. In addition, delay and energy consumption of the division operation are reduced about 94.4% and 99.93%, respectively, compared to those of an exact SRT radix-4 divider. Finally, the efficacy of the proposed divider in image processing application is studied. |
|            |       |                                                          | Download Paper (PDF; Only available from the DATE venue WiFi)             |
| 15:30      |       | End of session                                           |                                                          |

**Coffee Break in Exhibition Area**

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

**Tuesday, March 28, 2017**
- Coffee Break 10:30 - 11:30
- Coffee Break 16:00 - 17:00

**Wednesday, March 29, 2017**
- Coffee Break 10:00 - 11:00
- Coffee Break 16:00 - 17:00

**Thursday, March 30, 2017**
- Coffee Break 10:00 - 11:00
- Coffee Break 15:30 - 16:00
This session focuses on issues related to timing and layout in the presence of manufacturing variability and photolithographic limitations. The first paper reduces pessimism in timing analysis by estimating path sensitization while accounting for delay variations. The second paper enables patterning with reduced wirelength and overlay violation through placement refinement. The third paper improves manufacturability with an optimization algorithm for cut locations in line-end process. The last paper discusses clock tree synthesis to reduce delay sensitivity mismatch with gate delay circuitry.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
<th>Abstract</th>
</tr>
</thead>
<tbody>
<tr>
<td>14:00</td>
<td>11.4.1</td>
<td>QUANTIFYING ERROR: EXTENDING STATISTICAL TIMING ANALYSIS WITH PROBABILISTIC TRANSITIONS</td>
<td>Kevin Murray, University of Toronto, CA; Andrea Suardi, Imperial College, GB; George Constantinides</td>
<td>Timing analysis is a cornerstone of the digital design process. Statistical Static Timing Analysis was introduced to reduce pessimism by modelling device delay variations. However it ignores circuit logic, which may cause some timing paths to never or only rarely be sensitized. We introduce a general timing analysis approach and tool to calculate the probability that individual timing paths are sensitized, enabling the calculation of bounding delay distributions over all input combinations. We show the connection to the well-known #SAT problem and present two approaches to improve scalability, achieving average results 46 to 32% less pessimistic than Static Timing Analysis while running 14.6 to 44.0 times faster than Monte-Carlo timing simulation.</td>
</tr>
<tr>
<td>14:30</td>
<td>11.4.2</td>
<td>ON REFINING STANDARD CELL PLACEMENT FOR SELF-ALIGNED DOUBLE PATTERNING</td>
<td>Ting-Chi Wang, National Tsing Hua University, TW; Ye-Hong Chen, Sheng-He Wang and Ting-Chi Wang, National Tsing Hua University, TW</td>
<td>In this paper, we study the problem of refining a standard cell placement for self-aligned double patterning (SADP), which asks to simultaneously refine a detailed placement and find a valid SADP layout decomposition such that both overlay violation and wirelength are as small as possible. We present an algorithm that adopts the technique of white space insertion for an SADP-aware single-row cell placement problem. Based on the single-row algorithm, we then describe an approach to the addressed placement refinement problem. Finally, we report encouraging experimental results to support the efficacy of our approach.</td>
</tr>
<tr>
<td>14:40</td>
<td>11.4.3</td>
<td>CUT MASK OPTIMIZATION FOR MULTI-PATTERNING DIRECTED SELF-ASSEMBLY LITHOGRAPHY</td>
<td>Wachirawit Ponghiran, School of Electrical Engineering, KAIST, KR; Seongbo Shim, KAIST, KR; Youngsoo Shin, KAIST, KR</td>
<td>Line-end cut process has been used to create very fine metal wires in sub-14nm technology. Cut patterns split regular line patterns into a number of wire segments with some segments being used as atomic routing wires. In sub-7nm technology, cuts are smaller than optical resolution limit, and a directed self-assembly lithography with multiple patterning (MP-DSAL) is considered as a patterning solution. We address cut mask optimization problem for MP-DSAL, in which cut locations are determined in such a way that cuts are grouped into manufacturable clusters and assigned to one of masks without MP coloring conflicts; minimizing wire extensions is also pursued in the process. Only a restricted version of this problem has been addressed before while we do not assume any such restrictions. The problem is formulated as ILP first, and a fast heuristic algorithm is also proposed for application to larger circuits. Experimental results indicate that the ILP can remove all coloring conflicts, and reduce total wire extensions by 93% on average compared to those obtained by the restricted approach. Heuristic achieves a similar result with less than 1% of coloring conflicts and 91% reduction in total wire extensions.</td>
</tr>
<tr>
<td>15:15</td>
<td>11.4.4</td>
<td>CLOCK DATA COMPENSATION AWARE CLOCK TREE SYNTHESIS IN DIGITAL CIRCUITS WITH ADAPTIVE CLOCK GENERATION</td>
<td>Saibal Mukhopadhyay, Georgia Institute of Technology, US; Taesik Na, Saibal Mukhopadhyay, Georgia Institute of Technology, US</td>
<td>Adaptive clock generation to track critical path delay enables lowering supply voltage with improved timing slack under supply noise. This paper presents how to synthesize clock tree in adaptive clocking to fully exploit the clock data compensation (CDC) effect in digital circuits. The paper first provides analytical proof of ideal CDC effect for ring oscillator based clock generation. Second, the paper analyzes non-ideal CDC effect in a gate dominated critical path and wire dominated clock tree design. The paper shows the delay sensitivity mismatch between clock tree and critical path can degrade CDC effect by analyzing timing slack under power supply noise (PSN). Finally, the paper proposes simple but efficient clock tree synthesis (CTS) technique to maximize timing slack under PSN in digital circuits with adaptive clock generation.</td>
</tr>
<tr>
<td>15:30</td>
<td>IPS-11</td>
<td>TIMING-AWARE WIRE WIDTH OPTIMIZATION FOR SADP PROCESS</td>
<td>Youngsoo Song, KAIST, KR; Sangmim Kim and Youngsoo Shin, School of Electrical Engineering, KAIST, KR</td>
<td>With the scaling of the minimum feature size, RC delay of interconnect is relatively getting more critical in next node technology. SADP is one of the popular processes used in sub-7nm technology. For SADP process, we can increase wire width using patterns formed by block mask, which can reduce wire resistance of critical nets. We determine the direction and length of each wire widening, so that the resulting layout is conflict-free. We convert this as a maximum weight independent set problem and solve this by formulating an ILP. For various test circuits, the wire resistance of critical nets was reduced on average by 18.5%, which led to 9.9% reduction in clock period. The wire width optimization in SADP process can give an insight into timing optimization through the enhancement of fabrication process.</td>
</tr>
</tbody>
</table>

Download Paper (PDF; Only available from the DATE venue WiFi)
<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>11:50</td>
<td></td>
<td>End of session</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Coffee Break</td>
<td>Coffee Break in Exhibition Area</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.</td>
<td></td>
</tr>
<tr>
<td>Tuesday, March 28, 2017</td>
<td></td>
<td>Coffee Break 10:30 - 11:30</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Coffee Break 16:00 - 17:00</td>
<td></td>
</tr>
<tr>
<td>Wednesday, March 29, 2017</td>
<td></td>
<td>Coffee Break 10:00 - 11:00</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Coffee Break 16:00 - 17:00</td>
<td></td>
</tr>
<tr>
<td>Thursday, March 30, 2017</td>
<td></td>
<td>Coffee Break 10:00 - 11:00</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Coffee Break 15:30 - 16:00</td>
<td></td>
</tr>
</tbody>
</table>

### 11.5 Smart Energy and Automotive Systems

**Date:** Thursday 30 March 2017  
**Time:** 14:00 - 15:30  
**Location / Room:** 3C  
**Chair:**  
Geoff Merrett, University of Southampton, GB  
**Co-Chair:**  
Michele Magno, ETHZ, CH

This session presents the state of the art in efficient automotive software, smart battery systems and the latest strives toward energy neutral wireless communications systems.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
</table>
| 14:00    |            | **(Best Paper Award Candidate)** ON REDUCING BUSY WAITING IN AUTOSAR VIA TASK-RELEASE-DELTA-BASED RUNNABLE REORDERING | Robert Höttger, Dortmund University of Applied Sciences and Arts, DE  
Authors:  
Robert Höttger¹, Olaf Spinczyk² and Burkhard Igel¹  
¹FH-Dortmund, DE; ²TU-Dortmund, DE  
**Abstract**  
The increasing amount of innovative software technologies in the automotive domain comes with challenges regarding inevitable distributed multi-core and many-core methodologies. Approaches for general purpose solutions have been studied over decades but do not completely meet the specific constraints (e.g. timing, safety, reliability, affinity, etc.) for AUTOSAR compliant applications. AUTOSAR utilizes a spinlock mechanism in combination with the priority ceiling protocol in order to provide mutually exclusive access to shared resources. The essential disadvantages of spinlocks are unpredictable task response times on the one hand and wasted computation time caused by busy waiting periods on the other hand. In this paper, we propose a concept of task-release-delta-based runnable reordering for the purpose of sequentializing parallel accesses to shared resources, resulting in reduced task response times, improved timing predictability, and increased parallel efficiency respectively. To achieve this, runnables that represent smallest executable program parts in AUTOSAR are reordered based on precedence constraints. Our experiments among industrial use cases show that task response times can be reduced by up to 18,2%.  
Download Paper (PDF; Only available from the DATE venue WiFi)  
| 14:30    |            | **POWER NEUTRAL PERFORMANCE SCALING FOR ENERGY HARVESTING MP-SOCS**                  | Benjamin Fletcher, University of Southampton, GB  
Authors:  
Benjamin Fletcher, Domenico Balsamo and Geoff Merrett, University of Southampton, GB  
**Abstract**  
Using energy 'harvested' from the environment to power autonomous embedded systems is an attractive ideal, alleviating the burden of periodic battery replacement. However, such energy sources are typically low-current and transient, with high temporal and spatial variability. To overcome this, large energy buffers such as supercapacitors or batteries are typically incorporated to achieve energy neutral operation, where the energy consumed over a certain period of time is equal to the energy harvested. Large energy buffers, however, pose environmental issues in addition to increasing the size and cost of systems. In this paper we propose a novel power neutral performance scaling approach for multiprocessor system-on-chips (MP-SoC) powered by energy harvesting. Under power neutral operation, the system's performance is dynamically scaled through DVFS and DPM such that the instantaneous power consumption is approximately equal to the instantaneous harvested power. Power neutrality means that large energy buffers are no longer required, while performance scaling ensures that available power is effectively utilised. The approach is experimentally validated using the Samsung Exynos5422 big.LITTLE SoC directly coupled to a monocrystalline photovoltaic array, with only 47mF of intermediate energy storage. Results show that the proposed approach is successful in tracking harvested power, stabilising the supply voltage to within 5% of the target value for over 93% of the test duration, resulting in the execution of 69% more instructions compared to existing static approaches.  
Download Paper (PDF; Only available from the DATE venue WiFi)
15:00 11.5.3 EFFICIENT DECENTRALIZED ACTIVE BALANCING STRATEGY FOR SMART BATTERY CELLS
Speaker: Nitin Shivaraman, Nanyang Technological University, SG
Authors: Nitin Shivaraman, Arvind Easwaran and Sebastian Steinhorst
1Nanyang Technological University, SG; 2Technical University of Munich, DE
Abstract
Among series-connected cells in large battery packs, such as those found in electric vehicles, a charge imbalance develops over time due to manufacturing and temperature variations. Therefore, active balancing strategies can be employed in Battery Management Systems (BMSs) to attain a charge balance among cells by transferring charge between them, maximizing the usable capacity of the battery pack. Recently, decentralized BMS architectures with smart battery cells have been developed, in which balancing strategies can operate by local cooperation between the cells without requiring global coordination. In this paper, we propose a decentralized active balancing strategy for smart cells where we identify boundary cells having special properties. These boundary cells enable to divide the global balancing problem into independent subproblems, where local decisions on charge transfers eventually converge to a globally balanced battery pack. The proposed strategy is implemented in a simulator framework and compared with two decentralized state-of-the-art strategies. Our results show significantly improved performance and scalability of the proposed strategy in terms of charge transfer losses and communication overhead between cells, while maintaining a comparable time to balance.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:15 11.5.4 WULORA: AN ENERGY EFFICIENT IOT END-NODE FOR ENERGY HARVESTING AND HETEROGENEOUS COMMUNICATION
Speaker: Michele Magno, ETH Zurich, CH
Authors: Michele Magno, Faycal Ait Aoudia, Matthieu Gautier, Olivier Berder and Luca Benini
1ETH Zurich, CH; 2Irisa - University of Rennes, FR; 3University of Rennes 1, IRISA, INRIA, FR; 4Irisa -University of Rennes, FR; 5Università di Bologna, IT
Abstract
Intelligent connected objects, which build the IoT, are electronic devices usually supplied by batteries that significantly limit their life-time. These devices are expected to be deployed in very large numbers, and manual replacement of their batteries will severely restrict their large-scale or wideareas deployments. Therefore energy efficiency is of the utmost importance in the design of these devices. The wireless communication between the distributed sensor devices and the host stations can consume significant energy, even more when data needs to reach several kilometers of distance. In this paper, we present an energy-efficient multi-sensing platform that exploits energy harvesting, long-range communication and ultra-low-power shortrange wake-up radio to achieve self-sustainability in a kilometer range network. The proposed platform is designed with power efficiency in mind and exploits the always-on wake-up radio as both receiver and a power management unit to significantly reduce the quiescent current even continuously listening the wireless channel. Moreover the platform allows the building of an heterogeneous long-short range network architecture to reduce the latency and reduce the power consumption in listening phase at only 4.6uW. Experimental results and simulations demonstrate the benefits of the proposed platform and heterogeneous network.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30 IPS-12, 84 FORMAL TIMING ANALYSIS OF NON-SCHEDULED TRAFFIC IN AUTOMOTIVE SCHEDULED TSN NETWORKS
Speaker: Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE
Authors: Fedor Smirnov, Michael Glaß, Felix Reimann and Jürgen Teich
1Friedrich-Alexander-Universität Erlangen-Nürnberg, DE; 2Tübingen University, DE; 3Audi Electronics Venture GmbH, DE
Abstract
To cope with requirements for low latency, the upcoming Ethernet standard Time-Sensitive Networking (TSN) provides enhancements for scheduled traffic, enabling mixedcriticality networks where critical messages are sent according to a system-wide schedule. While these networks provide a completely predictable behavior of the scheduled traffic by construction, timing analysis of the critical non-scheduled traffic with hard deadlines remains an unsolved issue. State-of-the-art analysis approaches consider the interference that unscheduled messages impose on each other, but there is currently no approach to determine the worst-case interference that can be imposed by scheduled traffic, a so-called schedule interference (SI), without relying on restrictions of the shape of the schedule. Considering all possible interference scenarios during each calculation of the SI is impractical, as it results in an explosion of the computational time. As a remedy, this paper proposes a) an approach to integrate the analysis of the worst-case SI into state-of-the-art timing analysis approaches and b) preprocessing techniques that reduce the computation time of the SI-calculation by several orders of magnitude without introducing any pessimism.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:31 IPS-13, 368 ULTRA LOW-POWER VISUAL ODOMETRY FOR NANO-SCALE UNMANNED AERIAL VEHICLES
Speaker: Daniele Palossi, ETH Zurich, CH
Authors: Daniele Palossi, Andrea Marongiu and Luca Benini
1ETH, Zurich, CH; 2Swiss Federal Institute of Technology in Zurich (ETH2), CH; 3Università di Bologna, IT
Abstract
One of the fundamental functionalities for autonomous navigation of Unmanned Aerial Vehicles (UAVs) is the hovering capability. State-of-the-art techniques for implementing hovering on standard-size UAVs process camera stream to determine position and orientation (visual odometry). Similar techniques are considered unaffordable in the context of nano-scale UAVs (i.e. few centimeters of diameter), where the ultra-constrained power- envelopes of tiny rotor-crafts limit the on-board computational capabilities to those of low-power microcontrollers. In this work we study how the emerging ultra-low-power parallel computing paradigm could enable the execution of complex hovering algorithmic flows onto nano-scale UAVs. We provide insight on the software pipeline, the parallelization opportunities and the impact of several algorithmic enhancements. Results demonstrate that the proposed software flow and architecture can deliver unprecedented GOPS/W, achieving 117 frame-per-second within a power envelope of 10 mW.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:32 IPS-14, 598 LONG RANGE WIRELESS SENSING POWERED BY PLANT-MICROBIAL FUEL CELL
Speaker: Maurizio Rossi, University of Trento, IT
Authors: Maurizio Rossi, Pietro Tosato, Luca Gemma, Luca Torquati, Cristian Catania, Sergio Camalò and Davide Brunelli, University of Trento, IT
Abstract
Going low power and having a low or neutral impact on the environment is key for embedded systems, as pervasive and wearable consumer electronics is growing. In this paper, we present a self-sustaining, ultra-low power device, supplied by a Plant-Microbial Fuel Cell (PMFC) and capable of smart sensing and long-range communication. The use of a PMFC as a power source is challenging but has many advantages like the only requirement of watering the plant. The system uses aggressive power management thanks to FRAM technology exploited to retain microcontroller status and to shutdown electronics without losing context information. Experimental results show that the proposed system paves the way to energy neutral sensors powered by biosystems available almost anywhere on Earth.
Download Paper (PDF; Only available from the DATE venue WiFi)
### 11.6 Dependable microprocessors and systems

**Date:** Thursday, March 30, 2017  
**Time:** 14:00 - 15:30  
**Location / Room:** SA

**Chair:**  
Maksim Jenihhin, Tallinn University of Technology, EE

**Co-Chair:**  
Antonio Miele, Politecnico di Milano, IT

The section presents two papers investigating the effects of soft errors on critical registers and hardware methods to detect intrusion attacks in microprocessors. A third paper provides a solution for estimating multiprocessor expected lifetime.

### 11.6.1 CHARACTERIZATION OF STACK BEHAVIOR UNDER SOFT ERRORS

**Authors:**  
Junchi Ma, School of Computer Science and Engineering, Southeast University, CN

**Abstract**  
As process technology scales, electronic devices become more susceptible to soft error induced by radiation. The stack in the memory implements procedure calls and its behavior under soft error has not been studied yet. To analyze the effects of soft error on the stack behavior, we conduct a series of fault injection experiments in the IA-32 instruction set architecture. The injection targets are the ESP register (used as the stack pointer) and the EBP register (used as the stack-frame base pointer). We obtain a few important observations from the fault injection experiments. Results show that injections on ESP lead to silent data corruption (SDC) or benign only if the flipped ESP points to another return address when executing the RET instruction, otherwise most of the injections cause crash. The injected bits of these SDC and benign cases are distributed in the particular bits (4-7) and the reason for the distribution is given. Moreover, flipped EBP may cause a series of infinite return operations, which is defined as return cycle. We describe the basic mechanism of return cycle and the essential condition for its occurrence.

**Download Paper (PDF; Only available from the DATE venue WiFi)**

### 11.6.2 MULTI-ARMED BANDITS FOR EFFICIENT LIFETIME ESTIMATION IN MPSoC DESIGN

**Authors:**  
Brett Meyer, McGill University, CA

**Abstract**  
Reliability in integrated circuits is becoming a critical issue with the miniaturization of electronics. Smaller process technologies have led to higher power densities, resulting in higher temperatures and earlier device wear-out. One way to mitigate failure is by over-provisioning resources and remapping tasks from failed components to components with spare capacity, or slack. Since the slack allocation design space is large, finding the optimal is difficult, as brute-force approaches are impractical. During design space exploration, device lifetimes are typically evaluated using Monte-Carlo Simulation (MCS) by sampling each design equally; this method is inefficient since poor designs are evaluated as accurately as good designs. A better method will focus sampling time on the designs that are difficult to distinguish, reducing the time required to evaluate a set of designs; this can be accomplished using Multi-armed Bandit (MAB) Algorithms. This work demonstrates that MAB achieve the same level of accuracy as MCS in 1.45 to 5.26 times fewer samples.

**Download Paper (PDF; Only available from the DATE venue WiFi)**
HARDWARE-BASED ON-LINE INTRUSION DETECTION VIA SYSTEM CALL ROUTINE FINGERPRINTING

Speaker:
Yiorgos Makris, The University of Texas at Dallas, US

Authors:
Liwei Zhou and Yiorgos Makris, The University of Texas at Dallas, US

Abstract
We introduce a hardware-based methodology for performing on-line intrusion detection in microprocessors. The proposed method extracts fingerprints from the basic blocks of the routine executed in response to a system call and examines their validity using a Bloom filter. Implementation in hardware renders spoofing attacks, to which operating system or hypervisor-level intrusion detection methods are vulnerable, ineffective. The proposed method is evaluated using kernel rootkits which covertly modify the system call service routines of a Linux operating system running on a 32-bit x86 architecture, implemented in the Simics simulation environment, while hardware overhead is evaluated using a predictive 45nm PDK.

Download Paper (PDF; Only available from the DATE venue WiFi)

EVALUATING MATRIX REPRESENTATIONS FOR ERROR-TOLERANT COMPUTING

Speaker:
Pareesa Golnari, Princeton University, US

Authors:
Pareesa Ameneh Golnari and Sharad Malik, Princeton University, US

Abstract
We propose a methodology to determine the suitability of different data representations in terms of their error-tolerance for a given application with accelerator-based computing. This methodology helps match the characteristics of a representation to the data access patterns in an application. For this, we first identify a benchmark of key kernels from linear algebra that can be used to construct applications of interest using any of several widely used data representations. This is then used in an experimental framework for studying the error tolerance of a specific data format for an application. As case studies, we evaluate the error-tolerance of seven data-formats on sparse matrix to vector multiplication, diagonal add, and two machine learning applications (i) principal component analysis (PCA), which is a statistical technique widely used in data analysis and (ii) movie recommendation system with Restricted Boltzmann Machine (RBM) as the core. We observe that the Dense format behaves well for complicated data accesses such as diagonal accessing but is poor in utilizing local memory. Sparse formats with simpler addressing methods and a careful selection of stored information, e.g., CRS and ELLPACK, demonstrate a better error-tolerance for most of our target applications.

Download Paper (PDF; Only available from the DATE venue WiFi)

SIMULATION-BASED DESIGN PROCEDURE FOR SUB 1 V CMOS CURRENT REFERENCE

Speaker:
Dmitry Osipov, University of Bremen, DE

Authors:
Dmitry Osipov and Steffen Paul, University of Bremen, DE

Abstract
This paper presents a new compact current reference and a simulation-based design procedure to establish the circuit parameters quickly and efficiently. To verify the proposed design procedure, two sub 1-V example circuits for two different reference current values (80 nA and 800 nA) were designed and simulated using 0.35 µm CMOS technology. The circuits are robust against supply voltage variation without the need for external bandgap. A line sensitivity of approximately 1-2%/V over the supply voltage range from sub 1 V is achieved in both cases. The simulated temperature coefficient (TC) values are 93 ppm/°C and 197 ppm/°C in the temperature range from 0°C to 120°C for the 800 nA and 80 nA references, respectively.

Download Paper (PDF; Only available from the DATE venue WiFi)

Coffee Break in Exhibition Area
On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017
- Coffee Break 10:30 - 11:30
- Coffee Break 16:00 - 17:00

Wednesday, March 29, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 16:00 - 17:00

Thursday, March 30, 2017
- Coffee Break 10:00 - 11:00
- Coffee Break 15:30 - 16:00

11.7 Formal Methods and Verification: Core Technologies and Applications

Date: Thursday 30 March 2017
Time: 14:00 - 15:30
Location / Room: 3B
Chair:
Barbara Jobstmann, EPFL / Cadence, CH
Co-Chair:
Christoph Scholl, University of Freiburg, DE

The session consists of three papers on formal verification and its applications. The first paper presents the use of grammar-based techniques for the analysis of high-end processor designs at the netlist level. The second paper considers a computer algebra-based technique to reverse engineer the irreducible polynomial used in the implementation of multipliers in finite fields. The third paper applies probabilistic model checking in a case study analyzing the dependability of optical communication networks with double-ring topologies (which have been proposed for multicast traffic in metropolitan areas).
state-of-the-art biologically-inspired techniques and devices that demonstrate the efficacy of such methods to designs focused on smart, low-power, and secure systems on chip. Recent years have witnessed a significant development and progress in these fields. The goal of this Special Session is to present latest research results from worldwide leading experts addressing inspired techniques such as evolutionary algorithms and artificial neural networks have been used in the mainstream circuit design community infrequently. Recent years have exchange for extremely low power consumption. However, even when this sacrifice is possible, other conflicting performance features must still be taken into account. Biologically-inspired product differentiators such as power, security and variability continue to be major design factors. For many applications a sacrifice of performance or accuracy is acceptable in

While advanced well-tuned techniques are employed in current integrated circuits to increase the lifetime of cyber-physical, IoT and other systems, major concerns and important product differentiators such as power, security and variability continue to be major design factors. For many applications a sacrifice of performance or accuracy is acceptable in exchange for extremely low power consumption. However, even when this sacrifice is possible, other conflicting performance features must still be taken into account. Biologically-inspired techniques such as evolutionary algorithms and artificial neural networks have been used in the mainstream circuit design community infrequently. Recent years have witnessed a significant development and progress in these fields. The goal of this Special Session is to present latest research results from worldwide leading experts addressing state-of-the-art biologically-inspired techniques and devices that demonstrate the efficacy of such methods to designs focused on smart, low-power, and secure systems on chip.

11.8 Hot Topic Session: Biologically-inspired techniques for smart, secure and low power SoCs

Date: Thursday 30 March 2017
Time: 14:00 - 15:30
Location / Room: Exhibition Theatre

Organisers:
Andy M. Tyrrell, University of York, GB
Lukas Sekanina, Brno University of Technology, CZ

Chair:
Andy M. Tyrrell, University of York, GB

Co-Chair:
Lukas Sekanina, Brno University of Technology, CZ

While advanced well-tuned techniques are employed in current integrated circuits to increase the lifetime of cyber-physical, IoT and other systems, major concerns and important product differentiators such as power, security and variability continue to be major design factors. For many applications a sacrifice of performance or accuracy is acceptable in exchange for extremely low power consumption. However, even when this sacrifice is possible, other conflicting performance features must still be taken into account. Biologically-inspired techniques such as evolutionary algorithms and artificial neural networks have been used in the mainstream circuit design community infrequently. Recent years have witnessed a significant development and progress in these fields. The goal of this Special Session is to present latest research results from worldwide leading experts addressing state-of-the-art biologically-inspired techniques and devices that demonstrate the efficacy of such methods to designs focused on smart, low-power, and secure systems on chip.
11.8.2 TOWARDS LOW POWER APPROXIMATE DCT ARCHITECTURE FOR HEVC STANDARD

Speaker: Zdenek Vaseck, Brno University of Technology, CZ

Authors: Zdenek Vaseck, Vojtech Mrazek and Lukas Sekanina, Brno University of Technology, CZ

Abstract: Video processing performed directly on IoT nodes is one of the most performance as well as energy demanding applications for current IoT technology. In order to support real-time high-definition video processing, energy-reduction optimizations have to be introduced at all levels of the video processing chain. This paper deals with an efficient implementation of Discrete Cosine Transform (DCT) blocks employed in video compression based on the High Efficiency Video Coding (HEVC) standard. The proposed multiplierless 4-input DCT implementations contain approximate adders and subtractors that were obtained using genetic programming. In order to manage the complexity of evolutionary approximation and provide formal guarantees in terms of errors of key circuit components, the worst and average errors were determined exactly by means of Binary decision diagrams. Under conditions of our experiments, approximate 4-input DCTs show better quality/power trade-offs than relevant implementations available in the literature. For example, 25% power reduction for the same error was obtained in comparison with a recently highly optimized implementation.

Download Paper (PDF; Only available from the DATE venue WiFi)

11.8.3 SEMANTIC DRIVEN HIERARCHICAL LEARNING FOR ENERGY-EFFICIENT IMAGE CLASSIFICATION

Speaker: Priyadarshini Panda, Purdue University, US

Authors: Priyadarshini Panda and Kaushik Roy, Purdue University, US

Abstract: Machine-learning algorithms have shown outstanding image recognition performance for computer vision applications. While these algorithms are modeled to mimic brain-like cognitive abilities, they lack the remarkable energy-efficient processing capability of the brain. Recent studies in neuroscience reveal that the brain resolves the competition among multiple visual stimuli presented simultaneously with several mechanisms of visual attention that are key to the brain’s ability to perform cognition efficiently. One such mechanism known as saliency based selective attention simplifies complex visual tasks into characteristic features and then selectively activates particular areas of the brain based on the feature (or semantic) information in the input. Interestingly, we note that there is a significant similarity among underlying characteristic semantics (like color or texture) of images across multiple objects in real world applications. This presents us with an opportunity to decompose a large classification problem into simpler tasks based on semantic or feature similarity. In this paper, we propose semantic driven hierarchical learning to construct a tree-based classifier inspired by the biological visual attention mechanism for optimizing energy-efficiency of machine learning classifiers. We exploit the inherent feature similarity across images to identify the input variability and use recursive optimization procedure, to determine data partitioning at each tree node, thereby, learning the feature hierarchy. A set of binary classifiers is organized on top of the learnt hierarchy to minimize the overall test-time complexity. The feature-based-learning allows selective activation of only those branches and nodes of the classification tree that are relevant to the input while keeping the remaining nodes idle. The proposed framework has been evaluated on Caltech-256 dataset and achieves 3.7x reduction in test complexity for 1.2% accuracy improvement in state-of-the-art one-vs-all tree-based method, and even higher improvements in test-time (of 5.5x) when some loss in output accuracy (up to 2.5%) is acceptable.

Download Paper (PDF; Only available from the DATE venue WiFi)

11.8.4 MACHINE LEARNING FOR RUN-TIME ENERGY OPTIMISATION IN MANY-CORE SYSTEMS

Speaker: Rishad Shafik, Newcastle University, GB

Authors: Dwarpayan Biswas1, Vibshna Balagopal1, Rishad Shafik1, Bashir Al-Hashimi1 and Geoff Merrett1

1University of Southampton, GB; 2Newcastle University, GB

Abstract: In recent years, the focus of computing has moved away from performance-centric serial computation to energy-efficient parallel computation. This necessitates run-time optimisation techniques to address the dynamic resource requirements of different applications on many-core architectures. In this paper, we report an intelligent run-time algorithms which have been experimentally validated for managing energy and application performance in many-core embedded system. The algorithms are underpinned by a cross-layer system approach where the hardware, software and application layers work together to optimise the energy-performance trade-off. Algorithm development is motivated by the biological process of how a human brain (acting as an agent) interacts with the external environment (system) changing their respective states over time. This leads to a pay-off for the action taken, and the agent eventually learns to take the optimal/best decisions in future. In particular, our online approach uses a model-free reinforcement learning algorithm that suitably selects the appropriate voltage-frequency scaling based on workload prediction to meet the applications’ performance requirements and achieve energy savings of up to 16% in comparison to state-of-the-art techniques, when tested on four ARM A15 cores of an ODROID-XU3 platform.

Download Paper (PDF; Only available from the DATE venue WiFi)

12.5 AN EVOLUTIONARY APPROACH TO HARDWARE ENCRYPTION AND TROJAN-HORSE MITIGATION

Speaker: Ernesto Sanchez, Politecnico di Torino, IT

Authors: Andrea Marracci, Marco Restifio, Ernesto Sanchez and Giovanni Squillero, Politecnico di Torino, IT

Abstract: New threats, grouped under the name of hardware attacks, became a serious concern in recent years. In a global market, untrusted parties in the supply chain may jeopardize the production of integrated circuits with intellectual-property piracy, illegal overproduction and hardware Trojan-horses (HT) injection. While one way to protect from overproduction is to encrypt the design by inserting logic gates that prevents the circuit from generating the correct outputs unless the right key is used, reducing the number of poorly-controllable signals is known to minimize the chances for an attacker to successfully hide the trigger for some malicious payload. Several approaches successfully tackled independently these two issues. This paper proposes a novel technique based on a multi-objective evolutionary algorithm able to increase hardware security by explicitly targeting both the minimization of rare signals and the maximization of the efficacy of logic encryption. Experimental results demonstrate the proposed method is effective in creating a secure encryption schema for all the circuits under test and in reducing the number rare signals on six circuits over nine, outperforming the current state of the art.

Download Paper (PDF; Only available from the DATE venue WiFi)
A voltage-scalable RISC processor integrating standard-cell based memory (SCMs) as an alternative to conventional SRAM macros, enabling it to operate at a 0.4 V single-supply voltage. The processor is implemented with the fully automated cell-based design, which leads to low design costs. By scaling the supply voltage and applying the back-gate biasing techniques, the power dissipation of the SCMs is less than 20 uW, enabling the SCMs to operate with ambient energy source only. In this demonstration, the SCMs of the processor operates with a lemon battery as the ambient energy source.

More information ...
UB11.6 GNOC5: AN ULTRA-FAST, HIGHLY EXTENSIBLE, CYCLE-ACCURATE GPU-BASED PARALLEL NETWORK-ON-CHIP SIMULATOR

Presenter:
Amir CHARIF, TIMA, FR

Authors:
Nacer-Eddine Zergainoh and Michael Nicolaides, TIMA, FR

Abstract
With the continuous decrease in feature sizes and the recent emergence of 3D stacking, chips comprising thousands of nodes are becoming increasingly relevant, and state-of-the-art NoC simulators are unable to simulate such a high number of nodes in reasonable times. In this demo, we showcase GNOC5, the first detailed, modular and scalable parallel NoC simulator running fully on GPU (Graphics Processing Unit). Based on a unique design specifically tailored for GPU parallelism, GNOC5 is able to achieve unprecedented speeds-ups with no loss of accuracy. To enable quick and easy validation of novel ideas, the programming model was designed with high extensibility in mind. Currently, GNOC5 accurately models a VC-based microarchitecture. It supports 2D and 3D mesh topologies with full or partial vertical connections. A variety of routing algorithms and synthetic traffic patterns, as well as dependency-driven trace-based simulation (Netrace), are implemented and will be demonstrated

More information ...

UB11.7 EMU: RAPID FPGA PROTOTYPING OF NETWORK SERVICES IN C#

Presenter:
Salvatore Galea, University of Cambridge, GB

Authors:
Nik Sultana1, Pietro Bressana2, David Greaves1, Robert Soulé2, Andrew W Moore1 and Noa Zilberman1
1University of Cambridge, GB; 2Università della Svizzera italiana, CH

Abstract
General-purpose CPUs and OS abstractions impose overheads that make it challenging to implement network functions and services in software. On the other hand, programmable hardware such as FPGAs suffer from low-level programming models which make the rapid development of network services cumbersome. We demonstrate Emu, a framework that makes use of an HLS tool (Kiwi) and enables the execution of high-level descriptions of network services, written in C#, on both x86 and Xilinx FPGA. Emu therefore opens up new opportunities for improved performance and power usage, and enables developers to more easily write network services and functions. We demonstrate C# implementations of network functions, such as Memcached and DNS Server, using Emu running on both x86 and NetFPGA-SUME platform and show that they are competitive to natively written hardware counterparts while providing a superior development and debug environment.

More information ...

UB11.9 HEPSYCODE: A SYSTEM-LEVEL METHODOLOGY FOR HW/SW CO-DESIGN OF HETEROGENEOUS PARALLEL DEDICATED SYSTEMS

Presenter:
Luigi Pomante, University of L’Aquila, IT

Authors:
Giacomo Valente1, Vittoriano Mutillo1, Daniele Di Pompeo1, Emilio Incerto2 and Daniele Ciambrone3
1University of L’Aquila, IT; 2Gran Sasso Science Institute, IT

Abstract
Heterogeneous parallel systems have been recently exploited for a wide range of application domains, for both the dedicated (e.g. embedded) and the general purpose products. Such systems can include different processor cores, memories, dedicated ICs and a set of connections between them. They are so complex that the design methodology plays a major role in determining the success of the products. So, this demo addresses the problem of the electronic system-level hw/sw co-design of heterogeneous parallel dedicated systems. In particular, it shows an enhanced CSP/SystemC-based design space exploration step (and related ESL-EDA prototype tools), in the context of an existing hw/sw co-design flow that, given the system specification and related F/NF requirements, is able to (semi)automatically propose to the designer: - a custom heterogeneous parallel architecture; - an HW/SW partitioning of the application; - a mapping of the partitioned entities onto the proposed architecture.

More information ...

16:30 End of session

IP5 Interactive Presentations

Date: Thursday 30 March 2017
Time: 15:30 - 16:00
Location / Room: IP sessions (in front of rooms 4A and 5A)

Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the morning. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award ‘Best IP of the Day’ is given.

Label | Presentation Title
--- | ---
Authors

IP5-1 | FORMAL MODEL FOR SYSTEM-LEVEL POWER MANAGEMENT DESIGN

Speaker:
Mirela Simonovic, Aggios, RS

Authors:
Mirela Simonovic1, Vojin Zivojnovic2 and Lazar Saranovac3
1University of Belgrade, RS; 2AGGIOS Inc., US; 3University of Belgrade, School of Electrical Engineering, RS

Abstract
In this paper we present a new formal model, called p-FSM, for system-level power management design. The p-FSM is a modular, compositional, hierarchical, and unified model for hardware and software components. The model encapsulates power management control mechanisms, operating states and properties of a component that affect power, energy and thermal aspects of the system. Inter-component dependencies are modeled through a component-based interface. By connecting multiple p-FSMs we gradually compose the model of the whole system which ensures correct-by-construction system-level control sequencing. The model can also be used to formally verify the functional correctness of the power management design.

Download Paper (PDF; Only available from the DATE venue WiFi)
EXTENDING MEMORY CAPACITY OF NEURAL ASSOCIATIVE MEMORY BASED ON RECURSIVE SYNAPTIC BIT REUSE
Speaker:
Tianchan Guan, Columbia University, US
Authors:
Tianchan Guan1, Xiaoyang Zeng1 and Mingoo Seok2
1Fudan University, CN; 2Columbia University, US
Abstract
Neural associative memory (AM) is one of the critical building blocks for cognitive workloads such as classification and recognition. It learns and retrieves memories as human brain does, i.e., changing the strengths of plastic synapses (weights) based on inputs and retrieving information by information itself. One of the key challenges in designing AM is to extend memory capacity (i.e., memories that a neural AM can learn) while minimizing power and hardware overhead. However, prior arts show that memory capacity scales slowly, often logarithmically or in square root with the total bits of synaptic weights. This makes it prohibitive in hardware and power to achieve large memory capacity for practical applications. In this paper, we propose a synaptic model called recursive synaptic bit reuse, which enables near-linear scaling of memory capacity with total synaptic bits. Also, our model can handle input data that are correlated, more robustly than the conventional model. We experiment our proposed model in Hopfield Neural Networks (HNN) which contains the total synaptic bits of 5kB to 327kB and find that our model can increase the memory capacity as large as 30X over conventional models. We also study hardware cost via VLSI implementation of HNNs in a 65nm CMOS, confirming that our proposed model can achieve up to 10X area savings at the same capacity over conventional synaptic model.
Download Paper (PDF; Only available from the DATE venue WiFi)

ANOMALIES IN SCHEDULING CONTROL APPLICATIONS AND DESIGN COMPLEXITY
Speaker:
Amir Aminifar, Swiss Federal Institute of Technology in Lausanne, CH
Authors:
Amir Aminifar1 and Enrico Bini 2
1Swiss Federal Institute of Technology in Lausanne (EPFL), CH; 2University of Turin, IT
Abstract
Today, many control applications in cyber-physical systems are implemented on shared platforms. Such resource sharing may lead to complex timing behaviors and, in turn, instability of control applications. This paper highlights a number of anomalies demonstrating complex timing behaviors caused as a result of resource sharing. Such anomalous scenarios, then, lead to a dramatic increase in design complexity, if not properly considered. Here, we demonstrate that these anomalies are, in fact, very improbable. Therefore, design methodologies for these systems should mainly be devised and tuned towards the majority of cases, as opposed to anomalies, but should also be able to handle such anomalous scenarios.
Download Paper (PDF; Only available from the DATE venue WiFi)

MODELING AND INTEGRATING PHYSICAL ENVIRONMENT ASSUMPTIONS IN MEDICAL CYBER-PHYSICAL SYSTEM DESIGN
Speaker:
Chunhui Guo, Illinois Institute of Technology, US
Authors:
Zhicheng Fu1, Chunhui Guoi, Shangqing Reni, Yu Jiang2 and Lui Sha3
1Illinois Institute of Technology, US; 2Tsinghua University, CN; 3University of Illinois at Urbana-Champaign, US
Abstract
Implicit physical environment assumptions made by safety critical cyber-physical systems, such as medical cyber-physical systems (M-CPS), can lead to catastrophes. Several recent U.S. Food and Drug Administration (FDA) medical device recalls are due to implicit physical environment assumptions. In this paper, we develop a mathematical assumption model and composition rules that allow M-CPS engineers to explicitly and precisely specify assumptions about the physical environment in which the designed M-CPS operates. Algorithms are developed to integrate the mathematical assumption model with system model so that the safety of the system can be not only validated by both medical and engineering professionals but also formally verified by existing formal verification tools. We use an FDA recalled medical ventilator scenario as a case study to show how the mathematical assumption model and its integration in M-CPS design may improve the safety of the system.
Download Paper (PDF; Only available from the DATE venue WiFi)

A UTILITY-DRIVEN DATA TRANSMISSION OPTIMIZATION STRATEGY IN LARGE SCALE CYBER-PHYSICAL SYSTEMS
Speaker:
Bei Yu, The Chinese University of Hong Kong, HK
Authors:
Soumi Chattopadhyay1, Ansuman Banerjee1 and Bei Yu2
1Indian Statistical Institute, IN; 2The Chinese University of Hong Kong, HK
Abstract
In this paper, we examine the problem of data dissemination and optimization in the context of a large scale distributed cyber-physical system (CPS), and propose a novel rule-based mechanism for effective observation collection and transmission. Our work rests on the idea that all observations on all parameters are not required at all times, and thereby, selective data transmission can reduce sensor workload significantly. Experiments show the efficacy of our proposal.
Download Paper (PDF; Only available from the DATE venue WiFi)
**PROTECT NON-VOLATILE MEMORY FROM WEAR-OUT ATTACK BASED ON TIMING DIFFERENCE OF ROW BUFFER HIT/MISS**

**Speaker:** Haiyu Mao, Tsinghua University, CN

**Authors:** Haiyu Mao,1 Xian Zhang,2 Guangyu Sun1 and Jiwu Shu1

1Tsinghua University, CN; 2Peking University, CN

**Abstract**

Non-volatile Memories (NVMs), such as PCM and ReRAM, have been widely proposed for future main memory design because of their low standby power, high storage density, fast access speed. However, these NVMs suffer from the write endurance problem. In order to prevent a malicious program from wearing out NVMs deliberately, researchers have proposed various wear-leveling methods, which remap logical addresses to physical addresses randomly and dynamically. However, we discover that side channel leakage based on NWM row buffer hit information can reveal details of address remappings. Consequently, it can be leveraged to side-step the wear-leveling. Our simulation shows that the proposed attack method in this paper can wear out a NVM within 137 seconds, even with the protection of state-of-the-art wear-leveling schemes. To counteract this attack, we further introduce an effective countermeasure named Intra-Row Swap (IRS) to hide the wear-leveling details. The basic idea is to enable an additional intra-row block swap when a new logical address is remapped to the memory row. Experiments demonstrate that IRS can secure NVMs with negligible timing/energy overhead, compared with previous works.

**Download Paper (PDF; Only available from the DATE venue WiFi)**

**EFFECTS OF CELL SHAPES ON THE ROUTABILITY OF DIGITAL MICROFLUIDIC BIOCHIPS**

**Speaker:** Oliver Keszööze, University of Bremen, DE

**Authors:** Kevin Leonard Schneider,1 Oliver Keszööze,2 Jannis Stoppel3 and Rolf Drechsler 2

1University of Bremen, DE; 2University of Bremen/DFKI GmbH, DE

**Abstract**

Digital microfluidic biochips (DMFBs) are an emerging technology promising a high degree of automation in laboratory procedures by means of manipulating small discretized amounts of fluids. A crucial part in conducting experiments on biochips is the routing of discretized droplets. While doing so, droplets must not enter each others’ interference region to avoid unintended mixing. This leads to cells in the proximity of the droplet being impassable for others. For different cell shapes, the effect of these temporary blockages varies as the adjacency of cells changes with their shapes. Yet, no evaluation with respect to routability in relation to cell shapes has been conducted so far. This paper analyses and compares various tesselations for the field of cells. Routing benchmarks are mapped to these and the results are compared in order to determine if and how cell shapes affect the performance of DMFBs, showing that certain cell shapes are superior to others.

**Download Paper (PDF; Only available from the DATE venue WiFi)**

**LESS: BIG DATA SKETCHING AND ENCRYPTION ON LOW POWER PLATFORM**

**Speaker:** Amey Kulkarni, University of Maryland Baltimore County, US

**Authors:** Amey Kulkarni, Colin Shea1, Housian Homayoun2 and Tinoosh Mohsenin2

1University of Maryland, Baltimore County, US; 2University of Maryland Baltimore County, US; 3George Mason University, US

**Abstract**

Ever-growing IoT demands big data processing and cognitive computing on mobile and battery operated devices. However, big data processing on low power embedded cores is challenging due to their limited communication bandwidth and on-chip storage. Additionally, IoT and cloud-based computing demand low overhead security key to avoid data breaches. In this paper, we propose a Light-weight Encryption using Scalable Sketching (LESS) framework for big data sketching and encryption using One-Time Random Linear Projections (OTRLP). OTRLP encoded matrix makes the Known Plaintext Attacks (KPA) ineffective, and attackers cannot gain significant information from plaintext-ciphertext pair. LESS framework can reduce data up to 67% with 3.81~dB signal-to-reconstruction error rate (SNR). This framework has two important kernels “sketching” and “sketch-reconstruction”, the latter is computationally-intensive and costly. We propose to accelerate the sketch reconstruction using Orthogonal Matching Pursuit (OMP) on the domain specific many-core hardware named Power Efficient Nano Cluster (PENC) designed by authors. Detailed performance and power analysis suggests that PENC platform has 15x and 200x less energy consumption and 8x and 17x faster reconstruction time as compared to low power ARM CPU, and K1 GPU, respectively. To demonstrate efficiency of LESS framework, we integrate it with Hadoop MapReduce platform for objects and scenes identification application. The full hardware integration consists of tiny ARM cores which perform task scheduling and objects identification application, while PENC acts as an accelerator for sketch reconstruction. The full hardware integration results show that the LESS framework achieves 45% reduction in data transfers with very low execution overhead of 0.11% and negligible energy overhead of 0.001% when tested for 2.6GB streaming input data. The heterogeneous LESS framework requires 2x less transfer time and achieves 2.25x higher throughput per watt compared to MapReduce platform.

**Download Paper (PDF; Only available from the DATE venue WiFi)**

**TRUNCAPP: A TRUNCATION-BASED APPROXIMATE DIVIDER FOR ENERGY EFFICIENT DSP APPLICATIONS**

**Speaker:** Shaghayegh Vahdat, University of Tehran, IR

**Authors:** Shaghayegh Vahdat1, Mehdi Kamali1, Ali Arefzadeh-Khasa1, Zainalabedin Navab1 and Massoud Pedram2

1University of Tehran, IR; 2University of Southern California, US

**Abstract**

In this paper, we present a high speed yet energy efficient approximate divider where the division operation is performed by multiplying the dividend by the inverse of the divisor. In this structure, truncated value of the dividend is multiplied exactly (approximately) by the approximate inverse value of divisor. To assess the efficacy of the proposed divider, its design parameters are extracted and compared to those of a number of prior art dividers in a 45nm CMOS technology. Results reveal that this structure provides 66% and 52% improvements in the area and energy consumption, respectively, compared to the most advanced prior art approximate divider. In addition, delay and energy consumption of the division operation are reduced about 94.4% and 99.93%, respectively, compared to those of an exact SRT radix-4 structure.

**Download Paper (PDF; Only available from the DATE venue WiFi)**

**TIMING-AWARE WIRE WIDTH OPTIMIZATION FOR SADP PROCESS**

**Speaker:** Youngsoo Song, KAIST, KR

**Authors:** Youngsoo Song, Sangmin Kim and Youngsoo Shin, School of Electrical Engineering, KAIST, KR

**Abstract**

With the scaling of the minimum feature size, RC delay of interconnect is relatively getting more critical in next node technology. SADP is one of the popular processes used in sub-7nm technology. For SADP process, we can increase wire width using patterns formed by block mask, which can reduce wire resistance of critical nets. We determine the direction and length of each wire widening, so that the resulting layout is conflict-free. We convert this as a maximum weight independent set problem and solve this by formulating an ILP. For various test circuits, the wire resistance of critical nets was reduced on average by 18.5%, which led to 9.9% reduction in clock period. The wire width optimization in SADP process can give an insight into timing optimization through the enhancement of fabrication process.

**Download Paper (PDF; Only available from the DATE venue WiFi)**
We propose a methodology to determine the suitability of different data representations in terms of their error-tolerance for a given application with accelerator-based computing. This methodology helps match the characteristics of a representation to the data access patterns in an application. For this, we first identify a benchmark of key kernels from linear algebra that can be used to construct applications of interest using any of several widely used data representations. This is then used in an experimental framework for studying the error-tolerance of a specific data format for an application. As case studies, we evaluate the error-tolerance of seven data formats on sparse matrix to vector multiplication, diagonal add, and two machine learning applications i) principal component analysis (PCA), which is a statistical technique widely used in data analysis and ii) movie recommendation system with Restricted Boltzmann Machine (RBM) as the core. We observe that the Dense format demonstrates a better error-tolerance for most of our target applications.

Download Paper (PDF; Only available from the DATE venue WiFi)
This paper presents a new compact current reference and a simulation-based design procedure to establish the circuit parameters quickly and efficiently. To verify the proposed design procedure, two sub 1 V example circuits for two different reference current values (80 nA and 800 nA) were designed and simulated using 0.35 µm CMOS technology. The circuits are robust against supply voltage variation without the need for external bandgap. A line sensitivity of approximately 1-2%/V over the supply voltage range from sub 1 V is achieved in both cases. The simulated temperature coefficient (TC) values are 93 ppm/°C and 197 ppm/°C in the temperature range from 0°C to 120°C for the 800 nA and 80 nA references, respectively.

12.1 Wearable and Smart Medical Devices Day: Industry panel: Industrial challenges for tomorrow's medical devices and tools

Date: Thursday 30 March 2017
Time: 16:00 - 17:30
Location / Room: 5BC
Organisers:
José L. Ayala, Universidad Complutense de Madrid, ES
Chris Van Hoof, IMEC, BE
Chair:
Nick Van Helleputte, IMEC, BE
Co-Chair:
José L. Ayala, Universidad Complutense de Madrid, ES
This panel will analyze the Industrial challenges for tomorrow's medical devices and tools. We expect the invited industries to provide their view in how technology, market and users will drive the evolution of medical devices.

Panelists:
- Adrian Ionescu, Xsensio, CH
- David Bailey, Sensimed, CH
- Kamiar Aminian, GaitUp, CH
- Carl Van Himbeeck, Cochlear, BE
17:30 End of session

12.2 Advances in Microfluidics and Neuromorphic Architectures

Date: Thursday 30 March 2017
Time: 16:00 - 17:30
Location / Room: 4BC
Chair:
Tsung-Yi Ho, National Tsing Hua University, TW
Co-Chair:
Li Jiang, Shanghai Jiao Tong University, CN
This session consists of four presentations from emerging applications in EDA such as microfluidics and neural networks. The first presentation proposes a progressive optimization procedure for the synthesis of fault-tolerant flow-based microfluidics. The second presentation presents a hybrid microfluidic platform that enables single-cell analysis on a heterogeneous cells. Next presentation discusses automatic verification on networked labs-on-chip architecture. The final presentation proposes synthesis method for parallel convolutional layers of convolutional neural network.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
</table>
| 16:00 | 12.2.1  | FAST ARCHITECTURE-LEVEL SYNTHESIS OF FAULT-TOLERANT FLOW-BASED MICROFLUIDIC BIOCHIPS | Speaker:
Tsung-Yi Ho, National Tsing Hua University, TW
Authors:
Wei-Lun Huang\textsuperscript{1}, Ankur Gupta\textsuperscript{2}, Sudip Roy\textsuperscript{2}, Tsung-Yi Ho\textsuperscript{1} and Paul Pop\textsuperscript{3}
\textsuperscript{1}National Tsing Hua University, TW; \textsuperscript{2}Indian Institute of Technology Roorkee, IN; \textsuperscript{3}Technical University of Denmark, DK
Abstract
Microfluidic-based lab-on-a-chips have emerged as a popular technology for implementation of different biochemical test protocols used in medical diagnostics. However, in the manufacturing process or during operation of such chips, some faults may occur that leads to damage of the chip, which in turn results in wastage of expensive reagent fluids. In order to make the chip fault-tolerant, the state-of-the-art technique adopts simulated annealing (SA) based approach to synthesize a fault-tolerant architecture. However, the SA method is time consuming and non-deterministic with over-simplified model that usually derive sub-optimal results. Thus, we propose a progressive optimization procedure for the synthesis of fault-tolerant flow-based microfluidic bioships. Simulation results demonstrate that our method is efficient compared to the state-of-the-art techniques and can provide near-optimal and effective solutions in 88% (on average) less CPU time compared to state-of-the-art technique over three benchmark bioprotocols.
Download Paper (PDF; Only available from the DATE venue WiFi)
### 12.2.2 COSYN: EFFICIENT SINGLE-CELL ANALYSIS USING A HYBRID MICROFLUIDIC PLATFORM

**Speaker:** Mohamed Ibrahim, Duke University, US  
**Authors:** Mohamed Ibrahim¹, Krishnendu Chakrabarty¹ and Ulf Schlichtmann²  
¹Duke University, US; ²TU München, DE  

**Abstract**  
Single-cell genomics is used to advance our understanding of diseases such as cancer. Microfluidic solutions have recently been developed to classify cell types or perform single-cell biochemical analysis on pre-isolated types of cells. However, new techniques are needed to efficiently classify cells and conduct biochemical experiments on multiple cell types concurrently. System integration and design automation are major challenges in this context. To overcome these challenges, we present a hybrid microfluidic platform that enables complete single-cell analysis on a heterogeneous pool of cells. We combine this architecture with an associated design-automation and optimization framework, referred to as Co-Synthesis (CoSyn). The proposed framework employs real-time resource allocation to coordinate the progression of concurrent cell analysis. Simulation results show that CoSyn efficiently utilizes platform resources and outperforms baseline techniques.  
Download Paper (PDF; Only available from the DATE venue WiFi)

17:00 12.2.3 VERIFICATION OF NETWORKED LABS-ON-CHIP ARCHITECTURES

**Speaker:** Andreas Grimmer, Johannes Kepler University of Linz, AT  
**Authors:** Andreas Grimmer¹, Werner Haselmayr¹, Andreas Springer¹ and Robert Wille²  
¹Johannes Kepler University, AT; ²Johannes Kepler University Linz, AT  

**Abstract**  
Labs-on-Chips (LoCs) revolutionize conventional biochemical processes and may even replace laboratories by integrating and minimizing their functionalities on a single chip. In a promising and emerging realization of LoCs, small volumes of reagents, so-called droplets, transport the biological sample and flow in closed channels of sub-millimeter diameters. This realization is called Networked Labs-on-Chips (NuLCs). The architecture of an NuLC defines different paths through which the droplets can flow. These paths are realized by splitting channels into multiple successor channels - so-called bifurcations. However, whether the architecture indeed allows to route droplets along the desired paths and, hence, correctly executes the intended experiment is not guaranteed. In this work, we present the first automatic solution for verifying whether an NuLC architecture allows to correctly route the droplets. Our evaluations demonstrate the applicability and importance of the proposed solution on a set of NuLC architectures.  
Download Paper (PDF; Only available from the DATE venue WiFi)

17:15 12.2.4 SYNTHESIS OF ACTIVATION-PARALLEL CONVOLUTION STRUCTURES FOR NEUROMORPHIC ARCHITECTURES

**Speaker:** Seban Kim, Incheon National University, KR  
**Authors:** Seban Kim and Jaeyong Chung, Incheon National University, KR  

**Abstract**  
Convolutional neural networks have demonstrated continued success in various visual recognition challenges. The convolutional layers are implemented in the activation-serial or fully parallel manner on neuromorphic computing systems. This paper presents an unrolling method that generates parallel structures for the convolutional layers depending on a required level of parallel processing. We analyze the resource requirements for the unrolling of the two-dimensional filters, and propose methods to deal with practical considerations such as stride, borders, and alignment. We apply the propose methods to practical convolutional neural networks including AlexNet and the generated structures are mapped onto a recent neuromorphic computing system. This demonstrates that the proposed methods can improve the performance or reduce the power consumption significantly even without area penalty.  
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30 End of session

### 12.3 Security Tools

**Date:** Thursday 30 March 2017  
**Time:** 16:00 - 17:30  
**Location / Room:** 2BC  
**Chair:** Francesco Regazzoni, AlaRI/USI, CH  
**Co-Chair:** Georg Sigl, TU Munich, DE  

Security tools provide support to build secure systems. Such techniques have made great progress in past years with improvements in SAT solvers, theorem provers and available computing power. This session includes papers that perform information flow checks on hardware designs, to check for information leaks either directly through analysis of the design or indirectly through timing channels.

### 12.3.1 REGISTER TRANSFER LEVEL INFORMATION FLOW TRACKING FOR PROVABLY SECURE HARDWARE DESIGN

**Speaker:** Ryan Kastner, University of California, San Diego, US  
**Authors:** Armati Ardeshiricham¹, Wei Hu², Joshua Marxen² and Ryan Kastner³  
¹University of California San Diego, US; ²University of California, San Diego, US; ³UCSD, US  

**Abstract**  
Information Flow Tracking (IFT) provides a formal methodology for modeling and reasoning about security properties related to integrity, confidentiality, and logical side channel. Recently, IFT has been employed for secure hardware design and verification. However, existing hardware IFT techniques either require designers to rewrite their hardware specifications in a new language or do not scale to large designs due to a low level of abstraction. In this work, we propose Register Transfer Level IFT (RTLIFT), which enables verification of security properties in an early design phase, at a higher level of abstraction, and directly on RTL code. The proposed method enables a precise understanding of all logical flows through RTL design and allows various tradeoffs in IFT precision. We show that RTLIFT achieves over 5x speedup in verification performance as compared to gate level IFT while minimizing the required effort for the designer to verify security properties on RTL designs.  
Download Paper (PDF; Only available from the DATE venue WiFi)
The first paper presents a predictive approach to measure the impact of platform changes on the application performance. The second paper introduces a unifying approach for holistic, fine-grained, hierarchical and structured view of a cyber-physical system. We demonstrate the various benefits for modeling, analysis and synthesis.

Fast and accurate performance estimation is a key challenge in modern system design. Recently, machine learning-based approaches have emerged that allow predicting the performance of an application on a target platform from executions on a different host. However, existing approaches rely on expensive instrumentation that requires source code to be available. We propose a novel sampling-based, binary-level cross-platform prediction method that accurately predicts performance of a workload on a target by relying on various performance statistics sampled on a host using built-in hardware counters. In our proposed framework, samples acquired from the host and target do not satisfy straightforward one-to-one correspondence that characterizes prior instrumentation-based approaches. The resulting alignment problem is NP-hard; to solve it efficiently, we develop a stochastic dynamic coupling (SDC) algorithm which, under mild assumptions, with high probability closely approximates optimal alignment. The prediction model constructed using SDC-aligned samples achieves on average 96.5% accuracy for 45 benchmarks at speeds of over 3 GIPS. At similar accuracies, this is up to 6× faster than instrumentation-based approaches. The resulting alignment problem is NP-hard; to solve it efficiently, we develop a stochastic dynamic coupling (SDC) algorithm which, under mild assumptions, with high probability closely approximates optimal alignment. The prediction model constructed using SDC-aligned samples achieves on average 96.5% accuracy for 45 benchmarks at speeds of over 3 GIPS. At similar accuracies, this is up to 6× faster than instrumentation-based approaches.

The first paper presents a predictive approach to measure the impact of platform changes on the application performance. The second paper introduces a unifying approach for holistic, fine-grained, hierarchical and structured view of a cyber-physical system. We demonstrate the various benefits for modeling, analysis and synthesis.

A layered formal framework for modeling of cyber-physical systems is highly challenging due to its manifold interdependent aspects such as composition, timing, synchronization and behavior. Several formal models exist for description and analysis of these aspects, but they focus mainly on a single or only a few system properties. We propose a formal composable framework which tackles these concerns in isolation, while capturing interaction between them as a single, layered model. This yields a holistic, fine-grained, hierarchical and structured view of a cyber-physical system. We demonstrate the various benefits for modeling, analysis and synthesis through a typical example.

A LAYERED FORMAL FRAMEWORK FOR MODELING OF CYBER-PHYSICAL SYSTEMS
Speaker: George Ungureanu, KTH Royal Institute of Technology, SE
Authors: George Ungureanu and Ingo Sander, KTH Royal Institute of Technology, SE
Abstract: Designing cyber-physical systems is highly challenging due to its manifold interdependent aspects such as composition, timing, synchronization and behavior. Several formal models exist for description and analysis of these aspects, but they focus mainly on a single or only a few system properties. We propose a formal composable framework which tackles these concerns in isolation, while capturing interaction between them as a single, layered model. This yields a holistic, fine-grained, hierarchical and structured view of a cyber-physical system. We demonstrate the various benefits for modeling, analysis and synthesis through a typical example.

Download Paper (PDF: Only available from the DATE venue WiFi)
### 12.5 Power Modeling, Estimation and Verification

**Date:** Thursday 30 March 2017  
**Time:** 16:00 - 17:30  
**Location / Room:** 3C  
**Chair:** Pascal Vivet, CEA-Leti, FR  
**Co-Chair:** Hiroshi Nakamura, University of Tokyo, JP

This session covers a wide scope on power modeling and estimation in circuit design. The first paper presents a new model for modeling electromigration in power grid network, taking into account transient effects. The second paper introduces a fast and accurate thermal simulator for 3D circuits, taking into account thermal leakage dependency. The third paper proposes a new identification technique of fine grain power sources for multi-core without the knowledge of the thermal model. The last paper presents rule based checking for quick verification at implementation level of the power intent defined in UPF.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>17:00</td>
<td>12.4.3</td>
<td>EFFICIENT SYNCHRONIZATION METHODS FOR LET-BASED APPLICATIONS ON A MULTI-PROCESSOR SYSTEM ON CHIP</td>
<td>Gabriela Breaban, Technical University of Eindhoven, NL</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Authors: Gabriela Breaban, Sander Stuijk and Kees Goossens, Technical University of Eindhoven, NL</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Abstract</td>
<td>Distributed control applications cover a wide range of areas such as automotive, avionics, and automation. The Logical Execution Time (LET) Model of Computation (MoC) was proposed as a formal method to describe the functional and timing behavior of such applications. However, modern Multi-Processor Systems on Chip (MPSoC) do not have a shared notion of time between processors, due to their use of Globally Asynchronous Locally Synchronous (GALS) architecture. In this paper we propose two methods (based on FIFO channels and barriers) to implement time and data synchronization on a MPSoC. While a barrier synchronizes the execution flows of tasks at predefined points in their executions, a FIFO is an asynchronous data communication method between two tasks. First, they are used to implement LET applications. Next, we show how dataflow applications and mixed LET-dataflow applications are supported too. We implemented both methods on a MPSoC prototyped on a FPGA, and show that the data synchronization outperforms the related work by 67% in terms of software overhead.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Download Paper (PDF; Only available from the DATE venue WiFi)</td>
<td></td>
</tr>
<tr>
<td>17:30</td>
<td></td>
<td>End of session</td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
<tbody>
<tr>
<td>16:00</td>
<td>12.5.1</td>
<td>PHYSICS-BASED ELECTROMIGRATION MODELING AND ASSESSMENT FOR MULTI-SEGMENT INTERCONNECTS IN POWER GRID NETWORKS</td>
<td>Xiayi Wang, Beijing University of Technology, CN</td>
</tr>
</tbody>
</table>
| | | Authors: Xiayi Wang¹, Hongyu Wang², Jian He³, Sheldon X.-D. Tan⁴, Yici Cai⁵ and Shengqi Yang²  
¹Beijing Advanced Innovation Center for Future Internet Technology, Beijing Engineering Research Center for IoT Software and Systems, Beijing University of Technology, CN; ²Beijing University of Technology, CN; ³University of California, Riverside, US; ⁴TsingHua University, CN |
| | | Abstract | Electromigration (EM) is considered to be one of the most important reliability issues for current and future ICs in 10nm technology and below. In this paper we focus on the EM stress evaluation for one-dimensional multi-segment interconnect wires in which all the segments have the same direction, which is a common routing structure for power grid networks. The proposed method, which is based on integral transform technique, could efficiently calculate the hydrostatic stress evolution for multi-segment metal wires stressed with different current densities. The new method can also naturally consider the pre-existing residual stresses coming from thermal or other stress sources. Based on this new transient EM assessment method, a full-chip assessment algorithm for power grid networks is then proposed. The new algorithm is also based on the IR-drop metrics for failure assessment of the power grid networks. However, it finds the precise location and time of EM-induced void nucleation by directly checking the time-changing hydrostatic stresses of all the wires. The resulting EM assessment method can ensure sufficient accuracy of the EM verification for large scale power grid networks without sacrificing the efficiency. The accuracy of the proposed transient analysis approach is validated against the numerical analysis. Also the resulting EM-aware full-chip power grid reliability analysis has been demonstrated and compared with existing methods. |
| | | Download Paper (PDF; Only available from the DATE venue WiFi) |
| 16:30 | 12.5.2 | A FAST LEAKAGE AWARE THERMAL SIMULATOR FOR 3D CHIPS | Hameedah Sultan, IIT Delhi, IN |
| | | Authors: Hameedah Sultan and Smruti R. Sarangi, IIT Delhi, IN |
| | | Abstract | In this paper, we propose, 3DSim, which is an ultrafast thermal simulator for 3D chips. It simulates the effects of both dynamic and leakage power. Our technique captures the steady state as well as the transient response with a high speed and good accuracy. 3DSim uses an approach based on Green’s functions, where a Green’s function is defined as the impulse response of a unit power source. Our approach incorporates the effects of the leakage-temperature feedback loop, exploits the radial symmetry in the thermal profile, and uses Hankel transforms to yield a closed form solution for the leakage aware Green’s function. To further speed up our technique, we use fast numerical discrete Hankel transforms, and pre-compute and store certain functions in a lookup table. Our approach fundamentally converts a 3D problem to a set of 1D problems, thus leading to a 68X speedup as compared to competing simulators with an error limited to 1.5C. |
| | | Download Paper (PDF; Only available from the DATE venue WiFi) |
12.6 Efficient design methodologies for high-performance analog circuits and systems.

Date: Thursday 30 March 2017
Time: 16:00 - 17:30
Location / Room: SA

Chair: Nuno Horta, Instituto de Telecomunicacoes, PT
Co-Chair: Deuk Heo, Washington State University, US

This session presents area- and energy-efficient design methodologies for high-performance analog circuits and systems. These papers include an energy-efficient asynchronous digital design method for digitally-assisted analog circuits, a design method of high-density energy storage components, and a robust communication link design for power conversion, and can be traded off for the cost of analog components. A4A flow, A2A interfaces, and Workcraft tools are used for development of power management via multiple power domains can effectively save power by dynamically turning off idle domains. To control domains of a design, introducing low power intent complicates the physical implementation and verification process. During the physical implementation stage, the optimization or manual ECO could be tedious, and error-prone on power/ground signal connections. Therefore, in this paper, we focus on low power rule checking at the physical implementation stage for multiple power domain design. Existing methods adopt an iterative approach, which identifies one error at a time, thus possibly requiring multiple iterations. Different from them, we propose a fast low power rule checking approach to detect all errors at one time. To do so, we separate all paths into inner-domain and cross-domain paths and extract cross-domain net topology before power rule verification. Based on the global topology, we can verify the correctness of connections and detect all errors at the same time. Experimental results show the effectiveness and efficiency of our approach, achieving 3.62X speedups to detect all errors compared with the iterative approach. Moreover, our approach can identify complicated bugs to facilitate subsequent bug fixing.

17:30 End of session
Architectural optimizations are also presented to improve energy and performance of applications executing on GPU-based platforms. The papers in this session propose optimization techniques to improve the lifetime and performance of emerging technologies like persistent memory and scalable many-cores.

Semeen Rehman, Technische Universität Dresden, DE
Co-Chair:
Amit Singh, University of Southampton, GB
Chair:

**12.7 Software optimization for emerging memory architectures and technologies**

**Date:** Thursday 30 March 2017  
**Time:** 16:00 - 17:30  
**Location / Room:** 3B

**Chair:**  
Amit Singh, University of Southampton, GB

**Co-Chair:**  
Semeen Rehman, Technische Universität Dresden, DE

The papers in this session propose optimization techniques to improve the lifetime and performance of emerging technologies like persistent memory and scalable many-cores. Architectural optimizations are also presented to improve energy and performance of applications executing on GPU-based platforms.

<table>
<thead>
<tr>
<th>Time</th>
<th>Label</th>
<th>Presentation Title</th>
<th>Authors</th>
</tr>
</thead>
</table>
| 16:30 | 12.6.2 | HIGH-DENSITY MOM CAPACITOR ARRAY WITH NOVEL MORTISE-TENON STRUCTURE FOR LOW-POWER SAR ADC | Pang-Yen Chou, Technical University of Munich, DE  
Authors:  
1Nai-Chen Chen, 2Pang-Yen Chou, 3Helmut Graeb and 4Mark Po-Hung Lin  
1National Chung Cheng University, TW; 2Technische Universität München, DE; 3TU Muenchen, DE  
Abstract  
The design of capacitor structures have great impact on capacitance density, parasitic capacitance, routability, and matching quality of capacitor network in a SAR ADC, which may affect power, performance, and area of the whole data converter. Most of the recent studies focused on common-centroid placement and routing optimization of the capacitor network. Only few of them investigated the structures of highly integrated capacitors. In this paper, a novel mortise-tenon metal-oxide-metal capacitor structure is proposed, which has the advantages of high capacitance density and small parasitic capacitance. Based on the proposed structure, an integer-linear-programming based capacitor sizing and routing parasitic matching method is further introduced. Experimental results show that the proposed structure and method can achieve the best capacitance density and matching quality of the capacitor network in a SAR ADC.  
Download Paper (PDF; Only available from the DATE venue WiFi) |
| 17:00 | 12.6.3 | ADAPTIVE INTERFERENCE REJECTION IN HUMAN BODY COMMUNICATION USING VARIABLE DUTY CYCLE INTEGRATING DDR RECEIVER | Shreyas Sen, Purdue University, US  
Authors:  
1Shovan Maity, 2Debayan Das and 3Shreyas Sen  
1Purdue University, US; 2ECE, Purdue University, US  
Abstract  
Connected smart wearable devices are becoming increasingly popular with the advent of cheap, miniaturized, ultra-low-power computing and communication. Human Body Communication (HBC) is emerging as an alternative to Wireless Body Area Network (WBAN) for communication among these devices, as it provides higher energy-efficiency and security. One of the biggest bottlenecks of HBC is the interference picked up due to the human body antenna effect, with Signal-to-Interference Ratio often worse than −20dB. An interference robust solution involving dual data rate (DDR) receiver is introduced which can adapt itself to changing interference conditions and provide high interference rejection by Pulse Width Modulation of integration clock, thus dynamically changing its duty cycle. The theory, architecture of the receiver is developed along with the adaptation algorithm to train the receiver to find the optimum duty cycle of operation. System-level simulations show >20 dB of rejection even in presence of variable interference frequencies.  
Download Paper (PDF; Only available from the DATE venue WiFi) |
| 17:30 |       | End of session                                                                      |                                                                                                |
17:00  12.7.3 PEGASUS: EFFICIENT DATA TRANSFERS FOR PGAS LANGUAGES ON NON-CACHE-COHERENT MANY-CORES  
Speaker:  
Manuel Mohr, Karlsruhe Institute of Technology, DE  
Authors:  
Manuel Mohr and Carsten Tradowsky, Karlsruhe Institute of Technology, DE  
Abstract  
To improve scalability, some many-core architectures abandon global cache coherence, but still provide a shared address space. Partitioning the shared memory and communicating via messages is a safe way of programming such machines. However, accessing pointered data structures from a foreign memory partition is expensive due to the required serialization. In this paper, we propose a novel data transfer technique that avoids serialization overhead for pointered data structures by managing cache coherence in software at object granularity. We show that for PGAS programming languages, the compiler and runtime system can completely handle the necessary cache management, thus requiring no changes to application code. Moreover, we explain how cache operations working on address ranges complement our data transfer technique. We propose a novel non-blocking implementation of range-based cache operations by offloading them to an enhanced cache controller. We evaluate our approach on a non-cache-coherent many-core architecture using a distributed-kernel benchmark suite and demonstrate a reduction of communication time of up to 39.8%.  
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30  End of session

12.8 Hot Topic Session: Cyberphysical Microfluidic Biochips: EDA Challenges and Opportunities to Bridge the Gap between Microfluidics and Microbiology  
Date:  Thursday 30 March 2017  
Time:  16:00 - 17:30  
Location / Room:  Exhibition Theatre  
Organisers:  
Paul Pop, Technical University of Denmark, DK  
Seetal Potluri, Technical University of Denmark, DK  
Chair:  
Jan Madsen, Technical University of Denmark, DK  
Co-Chair:  
Seetal Potluri, Technical University of Denmark, DK  
Microfluidic biochips (also called lab-on-a-chip) are replacing the conventional biochemical analyzers by integrating all the necessary functions for biochemical analysis using microfluidics. The current trend is towards cyberphysical biochip platforms that integrate novel sensors and actuators, as well as on-chip control circuits. Motivated by the similarity to microelectronics, researchers have started to propose EDA tools for the synthesis of microfluidic biochips. However, we advocate for a paradigm shift, to bridge the formidable barrier that separates engineering (or chip design) from practical biochemistry and microbiology. The special session will serve as a “call to arms” for more focused and relevant research to increase the adoption of microfluidics in translational research.

16:00  12.8.1 DIGITAL-MICROFLUIDIC BIOCHIPS FOR QUANTITATIVE ANALYSIS: BRIDGING THE GAP BETWEEN MICROFLUIDICS AND MICROBIOLOGY  
Speaker:  
Krishnendu Chakrabarty, Duke University, US  
Authors:  
Mohamed Ibrahim and Krishnendu Chakrabarty, Duke University, US  
Abstract  
Digital-microfluidics technology has shown considerable promise for advancing sample preparation and point-of-care diagnostics; therefore, it has the potential to transform microbiology and biochemistry research. Over the past decade, a number of microfluidics design-automation techniques have been developed for on-chip droplet manipulation. However, these methods overlook the myriad complexities of biomolecular protocols and they have yet to make a significant impact in biochemistry/microbiology research. A paradigm shift in biochip design automation and a "phase transition" in research are clearly needed to bridge this gap between microfluidics and microbiology. In this paper, we explain how researchers from design-automation and embedded systems can play a key role in this transition. We present a new synthesis flow that uses realistic models of biomolecular protocols and cyberphysical adaptation to address real-world microbiology applications. We also present a list of metrics that can be used for the assessment of design-automation techniques for microbiology applications.  
Download Paper (PDF; Only available from the DATE venue WiFi)
16:30  12.8.2  THE CASE FOR SEMI-AUTOMATED DESIGN OF MVLSI BIOCHIPS

Speaker:
Jeffrey McDaniel, University of California, Riverside, US

Authors:
Jeffrey McDaniel, William H. Grover and Philip Brisk, University of California, Riverside, US

Abstract
In recent years, significant interest has emerged in the problem of fully automating the design of microfluidic very large scale integration (mVLSI) chips, a popular class of Lab-on-a-Chip (LoC) devices that can automatically execute a wide variety of biological assays. To date, this work has been carried out with little to no input from LoC designers. We conducted interviews with approximately 100 LoC designers, biologists, and chemists from academia and industry; uniformly, they expressed frustration with existing design solutions, primarily commercially available software such as AutoCAD and Solidworks; however, they expressed limited interest and considerable skepticism about the potential for “push-button” end-to-end automation. In response, we have developed a semi-automated mVLSI drawing tool that is designed specifically to address the pain points elucidated by our interviewees. We have used this tool to rapidly reproduce several previously published LoC architectures and generate fabrication ready specifications.

Download Paper (PDF; Only available from the DATE venue WiFi)

17:00  12.8.3  SYNTHESIS OF ON-CHIP CONTROL CIRCUITS FOR MVLSI BIOCHIPS

Speaker:
Seetal Potluri, Technical University of Denmark, DK

Authors:
Seetal Potluri, Alexander Schneider, Martin Horsley-Petersen, Paul Pop and Jan Madsen

1Xilinx Asia Pacific, SG; 2Technical University of Denmark, DK

Abstract
Microfluidic VLSI (mVLSI) biochips help perform biochemistry at miniaturized scales, thus enabling cost, performance and other benefits. Although biochips are expected to replace biochemical labs, including point-of-care devices, the off-chip pressure actuators and pumps are bulky, thereby limiting them to laboratory environments. To address this issue, researchers have proposed methods to reduce the number of off-chip pressure sources, through integration of on-chip pneumatic control logic circuits fabricated using three-layer monolithic membrane valve technology. Traditionally, mVLSI biochip physical design was performed assuming that all of the control logic is off-chip. However, the problem of mVLSI biochip physical design changes significantly, with introduction of on-chip control, since along with physical synthesis, we also need to (i) perform on/off-chip control partitioning, (ii) on-chip control circuit design and (iii) the integration of on-chip control in the placement and routing design tasks. In this paper we present a design methodology for logic synthesis and physical synthesis of mVLSI biochips that use on-chip control. We show how the proposed methodology can be successfully applied to generate biochip layouts with integrated on-chip pneumatic control.

Download Paper (PDF; Only available from the DATE venue WiFi)

17:15  12.8.4  SCHEDULING AND OPTIMIZATION OF GENETIC LOGIC CIRCUITS ON MICROFLUIDIC BIOCHIPS

Speaker:
Tsung-Yi Ho, National Tsing Hua University, TW

Authors:
Yu-Jhih Chen, Sumit Sharma, Sudip Roy and Tsung-Yi Ho

1National Tsing Hua University, TW; 2Indian Institute of Technology Roorkee, IN; 3IIT Roorkee, IN

Abstract
Synthetic biologists design genetic logic circuit using living cells. A challenge in this task is the difficulty in constructing bigger logic circuits with several living cells due to the crosstalk effect among the biological cells. In order to remove the crosstalk effect, current practice is to use separate chambers on a flow-based microfluidic biochip to isolate each reaction zone. The state-of-the-art technique assumes different reaction times for each gates in a genetic logic circuit. This assumption is pessimistic as each gate has different reaction rate from others. Hence, it will cause unnecessary waiting time for faster gates and this may in turn increase the total experiment completion time significantly. In this paper, we propose a genetic logic circuit synthesis technique for flow-based microfluidic biochip considering different reaction time of each logic gate. Simulation results show that the proposed scheme reduces the total experiment completion time. We further minimize the number of control valves and optimize the routing of flow and control layers in the chip layout, which in turn reduces the design cost.

Download Paper (PDF; Only available from the DATE venue WiFi)

Source URL: https://past.date-conference.com/date17/booklet/proof_reading