FM01 Welcome Reception & PhD Forum, hosted by EDAA, ACM SIGDA, and IEEE CEDA

Printer-friendly versionPDF version


18:00FM01Reception & PhD Forum, hosted by EDAA, ACM SIGDA, and IEEE CEDA
18:00FM01.1Design and Optimization of Resilient Ultra-Low Power Circuits
Mohammad Saber Golanbari and Mehdi Tahoori, Karlsruhe Institute of Technology, DE

Energy constrained systems become the cornerstone of emerging energy harvested or battery-limited platforms for Internet of Things (IoT) applications. Since the power consumption of a circuit is highly dependent on the supply voltage, aggressive voltage scaling to the near-threshold voltage region, also known as Near-Threshold Computing (NTC), is considered as an efficient way of reducing energy per operation [4], [5]. NTC provides a reasonable trade-off between the energy efficiency of a circuit and its performance [6]. The energy efficiency of NTC can be an order of magnitude better than in the nominal voltage while the performance is still acceptable for many application domains such as the IoT. However, along with the attractive energy benefits, NTC comes with a variety of design challenges. Reducing the supply voltage to the near-threshold voltage region also reduces the performance of the circuit and increases the sensitivity of the circuit to different variabilities, such as process variation, voltage fluctuation and temperature variation, by more than one order of magnitude [7]. The more sensitivity to noise and radiation also makes NTC devices more prone to runtime failures. Additionally, the reduced noise margin leads to orders of magnitude more functional failure rate in SRAM cells and logic designs, which mandates the use of more resilient designs for ultra-low voltage [6], [4]. Therefore, the benefits of NTC are not easily accessible without overcoming several design challenges such as 1) large performance drop, 2) increased sensitivity to variabilities, and 3) increased functional failure in the circuit elements [6]. The large performance reduction is the price designers pay in order to achieve a better energy efficiency. The escalated sensitivity to variabilities at reduced supply voltages forces the designers to add very conservative and expensive timing margins to achieve acceptable yield and reliability. Moreover, the designs and methods which are typically used in the nominal voltage range are inefficient for NTC circuit design, because they are not designed to deal with the high performance variation and sensitivity of circuits in the near-threshold voltage region. In summary, dealing with NTC challenges requires a new design paradigm as well as design automation flow for NTC. Various methods have been proposed to address some of these challenges at different abstraction levels [6], [4]. However, the existing methods do not cover all the aspects of NTC circuit design. The objective of this work is to provide a holistic approach for NTC design by tackling the major challenges, in the form of comprehensive design and design automation flow for NTC, mostly at circuit and logic levels. This also includes in-depth analysis of the reliability issues for NTC, as well as proposing optimization techniques for better reliability, performance, and energy efficiency.
18:00FM01.21024-Channel Single 5W FPGA Towards High-quality Portable 3D Ultrasound Platform
Aya Ibrahim, William Simon, Ahmet Caner Yüzügüler, Federico Angiolini, Marcel Arditi, Jean-Philippe Thiran and Giovanni De Micheli, EPFL, CH

Volumetric Ultrasound (US) imaging is an emerging technology for medical US applications. 3D US allows the imaging of entire volumes using a single scan, unlike in 2D imaging, where multiple slices should be acquired precisely by a trained sonographer to be able to diagnose the entire structure. As a result, 3D US imaging speeds up the acquisition time, and eliminates the dependency on the presence of a trained operator during the scan. However, today's 3D systems are stationary, expensive, and power hungry. This is why 3D US systems are only available in well-equipped hospitals, and not in rescue environments, battlefield, and rural areas. Our objective is to tackle smartly and efficiently the bottlenecks of the 3D US processing pipeline, with the aim of developing a portable, battery-operated, and cheap platform while supporting as many receiving channels as possible for providing high quality volumetric reconstruction. In this work, we develop a fully digital, high-quality, and single-FPGA beamformer, while supporting 1024-channels, the highest number of receiving channels by today's imagers, within 5W power consumption. This is considered as a crucial step towards our final target of a complete 3D US platform. We have demonstrated our architecture on a single Kintex Ultrascale KU-040 FPGA.
Engin Afacan, Bogazici University, TR

Reliability of CMOS circuits has become a major concern due to substantially worsening process variations and aging phenomena in deep sub-micron devices. As a result, conventional analog circuit sizing tools have become incapable of promising a certain yield whether it is immediately after production or after a certain period of time. Thereby, analog circuit sizing tools have been replaced by better ones, where reliability is included in the conventional optimization problem. Variation-aware analog circuit synthesis has been studied for many years, and numerous methodologies have been proposed in the literature. On the other hand, as far as we know, there has not been any tool that takes lifetime into account during the optimization. Besides, there are a number of different issues with lifetime-aware circuit optimization, where aging analysis is still quite problematic due to modeling and simulation deficiencies. Furthermore, both tools suffer from the challenging trade-off between efficiency and accuracy. Reconfigurable analog circuit design is another way of designing analog circuits against aging. However, design of a such complicated system is highly time consuming process to be performed by hand. Even though reconfigurable circuit design has been studied in the literature, there has been no attempt to automatize the design process to reduce the design time. With regard to aforementioned these problems, this study addresses all of these problems under a general title of reliability-aware analog circuit design automation, severally discusses them in detail, and proposes novel solutions to deal with not only existing but also not addressed problems.
18:00FM01.4Design Methodologies and Tools for Vertically Integrated Circuits
Harry Kalargaris, University of Manchester, GB

Semiconductors industry has been able to double the density of silicon devices every 18 months for the past five decades. However, economical and especially technical issues are slowing down the scaling effort beyond the 22 nm node. Vertical integration technologies, such as three-dimensional (3-D) integration and interposers, are technologies that support higher integration densities while the total wirelength is reduced as compared to traditional two-dimensional circuits. During my PhD, I have explored the opportunities and evaluated the improvements of these emerging technologies in performance and power. Interconnects on different interposer materials, such as glass and silicon, are investigated. Design guidelines for interconnects on glass and silicon interposers are offered that satisfy power, delay, area, and crosstalk constraints. Moreover, a custom EDA flow is implemented to capture the performance and power gains from the introduction of the third dimension. This flow, which utilizes commercial EDA tools, underpins the analysis of 3-D ICs from synthesis to GDSII implementation. Finally, voltage scaling in TSV-based three-dimensional circuits is investigated. The objective is to exploit the additional slack on critical paths from the introduction of the third dimension to improve power savings by reducing the supply voltage. Guidelines and a timing model based on the logical effort are offered for identifying early in the design process if 2-D circuits can benefit from voltage reduction with vertical integration. In addition to this qualitative model, a methodology for applying voltage reduction and quantifying the power savings in 3-D ICs is presented. In a case study, where a low-density parity-checker decoder is utilized, the traditional notion from reducing power due to the decrease in the wire capacitance of a 3-D IC leads to 10% power reduction as compared to the 2-D design. Alternatively, the proposed approach results in 34% power reduction.
18:00FM01.5Evolutionary Design of Approximate Components for Deep Learning on a Chip
Vojtech Mrazek, Brno University of Technology, CZ

Recent advances in artificial intelligence methods and a huge amount of computing resources available on a single chip have led to a renewed interest in efficient implementations of complex neuromorphic systems based on artificial neural networks (NNs). Implementing complex NNs in low power embedded systems requires careful optimization strategies at various levels including neurons, interconnects, learning algorithms, data storage and memory access. This work is focused on reducing power consumption of computations performed in neurons, which is of the same importance as optimizing the data storage and memory access [Judd et al. 2016]. Inexact or approximate computing has been adopted in recent years as a viable approach to reduce power consumption and improve the overall efficiency of computers. In approximate computing, circuits are not implemented exactly according to the specification, but they are simplified in order to reduce power consumption or increase operation frequency. It is assumed that the errors occurring in simplified circuits are acceptable, which is typical for error resilient application domains such as multimedia, classification and data mining. Applications based on NNs have proven to be highly error resilient [Chippa et al. 2013]. In this paper, an automated design space exploration method based on evolutionary algorithm is introduced. It is utilized for the design of well-optimized power-efficient NNs that have a uniform structure (i.e. all nodes are identical in all layers) which is thus suitable for hardware implementation. An error resilience analysis is performed in order to determine key constraints for the design of approximate multipliers that are employed in the resulting structure of NN. The method is capable to design approximate multipliers in such a way that the resulting multipliers show not only a given error, but also satisfy a set of other application-specific constraints. In general, the proposed search-based circuit approximation method can be employed in many applications utilizing approximate components. As an example we can mention approximate arithmetic circuits such as adders and multipliers, approximate sorting networks that produces partially sorted sequences or approximate median filters.
18:00FM01.6System-level functional and extra-functional characterization of SoCs through assertion mining
Alessandro Danese, University of Verona, IT

Virtual prototyping is today an essential technology for modelling, verification and re-design of SoCs. The core of virtual prototyping is represented by the virtual system prototype, i.e., an electronic system level (ESL) software simulator of the entire system, used first at the architectural level and then as an executable golden reference model throughout the design cycle. A fundamental role of the virtual prototyping is to allow system level functional and extra-functional verification of the SoC. In this context, this research activity aims at making automatic the extraction of functional and extra-functional properties that characterize the behaviors of a SoC through the analysis of its simulation traces. In particular, the proposed methodology is characterized by four main research activities. The first concerns the study of a parallelizable approach to make more efficient the extraction of invariants. The extracted invariants are arithmetic/logic relations that characterize stable conditions among the variables of the design throughout the simulation. The second activity consists in the definition of a mining technique that extracts temporal assertions by using user-defined temporal templates. The final set of invariants and temporal assertions represents the concrete set of functionalities implemented by the design under verification (DUV) that can be analyzed by the verification engineers to discover unexpected behaviors. The third activity aims to propose an innovative methodology for the automatic detection of security vulnerabilities of SoCs by performing corner cases analysis through assertion mining on symbolic traces of the firmware. Finally, the fourth activity concerns the automatic generation of power state machines (PSM) by adopting an approach that maps, through calibration process, the functional behaviors of the DUV with their corresponding power consumption.
18:00FM01.7Reliability Analysis and Dependable Execution Techniques for Processors in Nano-Scale Technologies
Ali Azarpeyvand, University of Zanjan, IR

Although increasing advances in integrated circuits manufacturing leads to low power consumption, higher performance, and more transistor integration; by approaching transistor size to atom dimension, variation in process, voltage, temperature, and specially high energy particle strike leads to transient errors which are raised as main challenge for the reliability of new-silicons. Recently, Aritectural Vulnerability Factor (AVF) that reflects the possibility of a failure when a so error occurs, has proposed as a major reliability analysis metric. In this dissertation we introduce a new method for online AVF estimation which can be exploited in reliability aware systems for trade-off between power consumption, cost, performance, and reliability. Proposed method utilizes the occupancy of important structures of the processor such as instruction queue and register file for AVF estimation and results show more than 90% accuracy. Estimation of AVF in the end of a current phase and using it for upcoming phase of program is the main shortcoming of AVF estimation methods. Online prediction of AVF in the beginning of each phase has been proposed for increasing accuracy. In this method, first the performance parameters of processor which are correlated by the AVF of processor have been predicted and then the AVF of processor has been estimated by regression using the predicted parameters. Our experimental results demonstrate 54% improvement in the accuracy when using predicted AVF and the accuracy of AVF estimation using regression increases to 96%. Next, a new metric named as Instruction Vulnerability Factor (IVF) and a novel AVF estimation method based on IVF have been proposed. IVF that shows the possibility of generation wrong output by a given instruction when soft error occurred, is utilized for fast and accurate AVF estimation based on the IVF of instructions which are used in application. Besides, it can be used for online AVF estimation with lower overheads. Results which are extracted by our Configurable Reliability Analysis Framework (CRAF) depict variation of IVF for different instructions as well as the accuracy of proposed AVF estimation method. According to growing usage of embedded and application specific instruction processors, IVF has been extended to custom instructions and Custom Instruction Vulnerability Factor (CIVF) has been proposed which can be exploited for reliability aware embedded processors customization. By using our Custom Instruction Vulnerability Framework (CIVA), variation in the vulnerability of different custom instruction has been revealed eve n for custom instruction with same operation but distinct layouts. Finally an analytical method for CIVF calculation has been proposed based simple probability theorems without need to fault injection and simulation. Our simulation results prove the correctness of this method.
18:00FM01.8Enabling Caches in Probabilistic Timing Analysis
Leonidas Kosmidis, Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES

Hardware and software complexity of future critical real-time systems challenges the scalability of traditional timing analysis methods. Measurement-Based Probabilistic Timing Analysis (MBPTA) has recently emerged as an industrially-viable alternative technique to deal with complex hardware/software. Yet, MBPTA requires certain properties in the system under analysis, which are not satisfied in conventional systems. In this thesis, we introduce, for the first time, hardware and software solutions to satisfy those requirements as well as to improve MBPTA applicability. We focus on one of the the hardware resources with highest impact on both average performance and Worst-Case Execution Time (WCET) in current real-time platforms, the cache. In this line, the contributions of this thesis follow three different axes: hardware solutions and software solutions to enable MBPTA, and MBPTA analysis enhancements in systems featuring caches. At hardware level, we set the foundations of MBPTA-compliant processor designs, and define efficient time-randomised cache designs for single- and multi-level hierarchies of arbitrary complexity, including unified caches which can be time-analysed for the first time. We propose three new software randomisation approaches (a dynamic and two static variants) to control, in an MBPTA-compliant manner, the cache jitter in Commercial off-the-shelf (COTS) processors in real-time systems. To that end, all variants randomly vary the location of programs' code and data in memory across runs, to achieve probabilistic timing properties similar to those achieved with customised hardware cache designs. We propose a novel method to estimate the WCET of a program using MBPTA, without requiring the end-user to identify worst-case paths and inputs, improving its applicability in industry. We also introduce Probabilistic Timing Composability, which allows Integrated Systems to reduce their WCET in the presence of time-randomised caches. With the above contributions, this thesis pushes the limits in the use of complex real-time embedded processor designs equipped with caches and paves the way towards the industrialisation of MBPTA technology.
18:00FM01.9Digital Processors with Alternative Number Systems Selected by Code Generation Mechanism
Saba Amanollahi and Ghassem Jaberipur, Shahid Beheshti University, IR

In this thesis we propose to extend the instruction set of the binary processors with some redundant-digit arithmetic operations, where selected binary instructions within a given code sequence are to be replaced with appropriate redundant-digit ones. The selection criteria is so enforced to lead to overall reduction of execution energy and energy-delay product (EDP). A special branch and bound algorithm is devised to modify the data flow graph (DFG) to a new one that takes advantage of the extended redundant-digit instruction set. The DFG is supplied via the intermediate code representation of GCC compilers. The required redundant-digit arithmetic operations, such as a binary-input redundant-output multiplier and divider has been designed. The proposed divider has the lowest energy consumption compared to the state-of-art dividers. In the case of multiplier, we have improved the binary parallel multiplier by using redundant-digit adder in the final stage of the multiplication. By utilizing the proposed multiplier, a multiply-add unit has been proposed which its efficiency is better than the binary multiply-add unit. Also, we have devised three- and four-operand adders which accept both binary and redundant inputs while their outputs are redundant digit. The corresponding delays of the proposed adders are smaller than that of a two-operand binary adder. The redundant-digit arithmetic units are synthesized on 45 nm NanGate technology by Synopsys Design Compiler. Their design parameters are used in modeling a binary in-order processor whose instruction set is extended by the proposed redundant digit arithmetic operators. The proposed extended processor is evaluated via several benchmarks from Mediabench and Mibench packages, where up to 26% energy and 44% EDP savings are experienced in comparison to conventional binary processors. Also, on the average, in the studied benchmarks, around 29.3% of the running instructions are selected from the proposed redundant-digit instructions.
18:00FM01.10NOMeS: Near-Optimal Metaheuristic Scheduling for MPSoCs
Amin Majd, University of Turku, FI

The usage of parallel processing in MPSoCs these days, in a vast variety of applications, is the result of many breakthroughs over the last two decades. The development of embedded MPSoCs has led to their use in many applications like health monitoring, video and audio processing, and autonomous vehicles to mention just a few. The directed acyclic task graph (DAG) that represents the precedence relations of the tasks is well known as an NP-complete problem. Many heuristic methods for the task scheduling dilemma have been proposed because the precedence constraints between tasks can be non-uniform therefore rendering the requirement for a uniformity solution. We assume that the MPSoC is uniform (a homogeneous multiprocessor system) and non-preemptive (each processor completes the current task before the new one starts its execution). Moreover, we explicitly consider the communication delays between processors. When two communicating tasks are mapped to the same processor we assume that the communication delay is zero. However, when they are mapped to different processors a communication delay is assumed and modeled. For this problem, we propose an extension of the priority-order-based coding method as the priority-based country (OBC). This coding shows the priority and selected processor to run each task. To improve the best result, we utilize a new multi-population method that is inspired by both Imperialist Competitive Algorithm (ICA) and Genetic Algorithm (GA).
18:00FM01.11Probabilistically Time-Analyzable Complex Processors in Hard Real-Time systems
Mladen Slijepcevic1, Carles Hernandez2, Jaume Abella3 and Francisco Cazorla4
1Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, ES; 2Barcelona Supercomputing Center, ES; 3Barcelona Supercomputing Center (BSC-CNS), ES; 4Barcelona Supercomputing Center and IIIA-CSIC, ES

Industry developing Critical Real-Time Embedded Systems (CRTES), such as Aerospace, Space, Automotive and Railways, faces relentless demands for increased processor performance to support advanced new functionalities. High-performance hardware and complex software can provide such functionality, but the use of aggressive technology challenges time-predictability. With traditional timing analysis techniques the pessimism of the Worst Case Execution Time (WCET) estimate grows if not enough information about hardware internal behaviour is available or hardware is complex and not amenable to WCET analysis. Conversely, time-randomized platforms have been shown a promising approach to deal with the timing analysis of complex hardware designs and result a promising approach to cope with reliability issues as well. Conversely to traditional Timing Analysis techniques, Probabilistic Timing Analysis (PTA) provides a distribution of WCET estimates so that the particular value at a given exceedance probability can be theoretically exceeded with a probability upper-bounded by the exceedance threshold chosen, which can be arbitrarily low (e.g., 10^-12 per hour), thus largely below the probability of hardware failures. We focus on the Measurement Based version of PTA (MBPTA) as it is closer to industrial practice. In this thesis we are proposing techniques to enable efficient usage of multi and manycores in CRTES. We focus on the investigation and development of (1) hardware mechanisms to control inter-task interferences in shared time-randomized caches (2) manycore Network-On-Chip designs, meeting the requirements of MBPTA and (3) an approach for obtaining WCET on complex processors operating in harsh environments.
18:00FM01.12Programming Models and Tools for Heterogeneous Many-Core Platforms
Alessandro Capotondi, Università di Bologna, IT

The negotiation between power consumption, performance, programmability, and portability drives all computing industry designs, in particular the mobile and embedded systems domains. Two design paradigms have proven particularly promising in this context: architectural heterogeneity and many-core processors. Parallel programming models are key to effectively harness the computational power of heterogeneous many-core SoC. This thesis presents a set of techniques and HW/SW extensions that enable performance improvements and that simplify programmability for heterogeneous many-core platforms. The thesis contributions cover vertically the entire software stack for many-core platforms, from hardware abstraction layers running on top of bare-metal, to programming models; from hardware extensions for efficient parallelism support to middleware that enables optimized resource management within many-core platforms. First, we present mechanisms to decrease parallelism overheads on parallel programming runtimes for many-core platforms, targeting fine-grain parallelism. Second, we present programming model support that enables the offload of computational kernels within heterogeneous many-core systems. Third, we present a novel approach to dynamically sharing and managing many-core platforms when multiple applications coded with different programming models execute concurrently. All these contributions were validated using STMicroelectronics STHORM, a real embodiment of a state-of-the-art many-core system, Texas Instrument Keystone II. Hardware extensions and architectural explorations were explored using VirtualSoC, a SystemC based cycle-accurate simulator of many-core platforms.
18:00FM01.13Logic Synthesis for Majority based In-Memory Computing
Saeideh Shirinzadeh, University of Bremen, DE

This paper presents a PhD thesis on synthesis and optimization for logic-in-memory computing architectures using resistive memories. We propose synthesis approaches which exploit logic data structures and optimize them with respect to the cost metrics of in memory computing. In particular, we propose two approaches based on recently introduced Majority-Inverter Graphs (MIGs), which have been shown to be efficient for synthesis of resistive in-memory computing circuits. One proposed approach aims at tackling the latency caused by the sequential nature of in-memory computing which needs a crossbar implementation capable of parallel computing, and the other presents automated synthesis of Boolean functions for standard resistive crossbar architectures. We utilize optimization techniques for both approaches to reduce the number of instructions and RRAM devices addressing the latency and area of the resulting circuits. We also consider the issue of low write endurance of RRAMs by balancing the write traffic over all memory cells to extend the lifetime of resistive crossbar architectures.
18:00FM01.14Analysis, Design, and Optimization of Embedded Control Systems
Amir Aminifar, Swiss Federal Institute of Technology in Lausanne (EPFL), CH

Today, many embedded and cyber-physical systems, e.g., in the automotive domain, comprise several control applications. These control applications are often implemented on shared platforms as a result of the recent shift from federated architectures to integrated architectures. Ignoring the implementation impacts during the design of such systems results in underutilized resources, poor control performance, or instability of control applications. In particular, it is well known that such resource sharing leads to complex temporal behavior that degrades the quality of control and, more importantly, may jeopardize the stability of control applications, if not properly taken into account during design. This thesis highlights the complexity involved in the design and optimization of embedded control and cyber-physical systems. We demonstrate that intuitively-better choices of parameters (e.g., higher priority for scheduling a control task) may jeopardize the stability of the control application. We highlight a number of anomalies, demonstrating complex timing behaviors caused as a result of resource sharing, and show that such anomalous scenarios dramatically increase the design complexity, if not properly considered. The complexity of such systems cannot be coped with unless novel design and optimization methodologies are adopted. Therefore, in this thesis, we propose a novel methodology for implementation of such embedded control systems. This thesis covers online and offline methodologies, considering expected (average) control performance and worst-case stability, for both periodic and self-triggered control paradigms. In summary, this thesis advances the state of the art in system-level design and optimization of embedded control systems running on shared platforms, by taking the implementation impacts into consideration during the design process. Throughout this thesis, the evaluation is done experimentally and supported by theoretical results.
18:00FM01.15Making the Case for Space-Division Multiplexing of Programmable Many-Core Accelerators
Marco Balboni, University of Ferrara, IT

This work leverages an advanced lightweight NoC-centric partitioning and reconfiguration technology to enable spatial partitioning of compute/memory resources of a many-core programmable accelerator, proving that this is the most efficient way to share its resources in presence of multiple offload requests and multi-programmed mixed- criticality workloads. Looking forward, the final outcome is to provide a vertically-integrated hardware/software PMCA architecture capable of dynamically, continuously and judiciously adapting a program's degree of parallelism to the prevailing dynamic execution conditions. This opens up unprecedented opportunities for engineering innovative high-end embedded systems combining simplified programming with power-efficient execution. In particular, this work fosters virtualization as a means of simplifying programming, and a dynamic architecture developing as a powerful hardware assistance for adaptive resource management.
18:00FM01.16Towards Computer-Aided Design of Quantum Logic
Philipp Niemann, Cyber-Physical Systems, DFKI GmbH, DE

Quantum computation provides a new way of computation based on so-called quantum bits (qubits). In contrast to the conventional bits used in Boolean logic, qubits do not only allow to represent the basis states 0 and1, but also superpositions of both. By this, qubits can represent multiple states at the same time which enables massive parallelism. Additionally exploiting further quantum-mechanical phenomena such as phase shifts or entanglement allows for the design of algorithms with an asymptotic speed-up for many relevant problems (e.g. database search or integer factorization). Motivated by these prospects, researchers from various domains investigated this emerging technology. While, originally, the exploitation of quantum-mechanical phenomena has been discussed in a purely theoretical fashion, in the past decade also the consideration of physical realizationshas has gained significant interest. However, in order to formalize the above mentioned quantum-mechanical phenomena, states of qubits are modelled as vectors in high-dimensional Hilbert spaces and are manipulated by quantum operations which can be described by unitary matrices - possibly including complex numbers. This poses serious challenges to the representation, but even more to the development of proper and efficient methods for quantum circuit synthesis that would scale to quantum systems of considerable size. In my PhD thesis, I provide substantial improvements to the state-of-the-art in quantum logic design, especially with respect to and based on the efficient representation of quantum functionality. In order to address the enormous challenges, the design of quantum circuits is not considered as a single design step. I rather consider the problem as a separation of concerns and aim for adequate solutions for essential sub-problems of quantum logic design. The proposed approaches overcome several drawbacks and limitations of previously proposed approaches and provide scalable, technology-aware, and automated solutions for dedicated sub-tasks of quantum logic design. To this end, the thesis constitutes an important step towards a CAD for quantum logic.
18:00FM01.17Towards HW Platform for Real-Time Systems
Lukas Kohutka, Slovak University of Technology in Bratislava, SK

Abstract—this paper presents recent results of our research and concept proposals within the domain of real-time systems with focus on improvement of various parameters like performance, determinism and robustness of such systems. As a solution, we propose a hardware platform for real-time systems that consists of the whole platform architecture at system level and individual platform components designed at register-transfer level. Basic principles used within this research are: hardware acceleration, pipelining, system modularity, and dynamic reconfiguration realized by field-programmable gate arrays, constant sorting and time sensitive networking.
18:00FM01.18High-level Modelling of OIN-based Systems with the Provision of a Low Latency Controller
Felipe Magalhaes, Ecole Polytechnique de Montreal, CA

This research presents system-level models for the deployment of OIN-based systems. The models use data extracted from fabricated devices, so the system's simulation maintain precision. Still, a low -latency controller is deployed, namely LUCC. It is designed with focus on reducing the control impact on the OIN efficiency.
18:00FM01.19Highly configurable place and route for analog and mixed-signal circuits
Eric Lao, Pierre and Marie Curie University (UPMC), FR

Many analog circuits use digital circuits for calibration purpose and when it comes to design layouts of such mixed-signal circuits, automation tools are not as mature as the digital ones. Nevertheless, they have been improved a lot, at a point that they can help at individual steps in the analog and mixed-signal design flow. This Ph. D. thesis is about creating a highly configurable analog and mixed signal placer and router guided by designers' preferences. Our approach is semi-automatic in order to let designers have control on the overall placement and routing but at the same time, some tiresome and error-prone tasks are automated. Digital and analog circuits have a dedicated area on a system-on-chip circuit so they can be independently designed within a specific space. Considering a netlist, our placement tool compute multiple placement results, according to designers' constraints, which match the dedicated area. Designers can experiment different placements which are generated in seconds and pick the placement that suits themselves the most according to their experience. Once the placement phase is performed, our router will route the circuit following a two steps procedure: global and detailed routing phases. The global routing is performed by converting the placement result into a graph of relation between rectangular areas representing a device or a channel. The graph will be used with a Dijkstra algorithm to establish by which areas each wire is going to pass by. Based on this routing estimation, channels are then expanded in order to leave enough space for wires to be placed by the detailed routing phase which goal is to assign route segments of signal nets to specific routing tracks, vias and metal layers. The detailed routing is able to respect common analog constraints such as symmetric routing. We demonstrate the capability of our tool by placing and routing a fully differential transconductor composed of 32 devices.
18:00FM01.20Optimization Techniques for Parallel Programming of Embedded Many-core Computing Platforms
Giuseppe Tagliavini, Università di Bologna, IT

Many-core computing platforms are widely adopted to accelerate compute-intensive workloads, but unfortunately their adoption highly increases programmers' effort. This aspect is particular critical in the embedded domain, which includes a wide range of heterogeneous devices with stringent constraints. In this work we present a set of techniques and designs that have been studied to achieve two main objectives: first, improving performance and energy efficiency of parallel embedded systems; second, enforcing software engineering practices with the aim to improve programmability. As a first contribution, we discuss the use of OpenMP tasking as a general-purpose programming model, introducing a set of runtime-level techniques to support fine-grain tasks on many-core accelerators. Experimental results show that this approach can achieve a near-ideal speed-up with an average task granularity of 7500 cycles, while previous approaches required about 100000 cycles. Since data transfers and memory management are major challenges in programming a heterogeneous platform, we explored alternative programming models to better address these issues. We introduce an extension of OpenCL to support graph-based workloads, which are common in a large class of applications (e.g., embedded computer vision). Experiments show that this solution provides huge benefits in terms of speed-up compared to standard OpenCL code. To enhance programmability, we designed complementary approaches based on OpenVX, also providing specific extensions for memory-constrained devices. To further reduce the power consumption of parallel embedded accelerators beyond the current technological limits, we finally introduce an approach based on the principles of approximate computing, which implies modification to the program semantics and proper hardware support at the architectural level. We propose a hybrid memory system for on-chip scratchpad memories and we provide constructs at programming model level for specifying which code and data are tolerant to approximation. We demonstrate that this architecture can reduce by 25% the energy consumption of a computing platform.
18:00FM01.21Analyzing and Supporting the Reliability Decision-making Process in Computing Systems with a Reliability Evaluation Framework
Maha Kooli1 and Giorgio Di Natale2

In this thesis, the reliability of the different software components is analyzed at different levels of the system (depending on the design phase), emphasizing the role that the interaction between hardware and software plays in the overall system. Then, the reliability of the system is evaluated via a flexible, fast, and accurate evaluation framework. Finally, the reliability decision-making process in computing systems is comprehensively supported with the developed framework (methodology and tools).
18:00FM01.22Techniques for scenario prediction and switching in system scenario based designs
Yahya Hussain Yassin, Norwegian University of Science and Technology (NTNU), NO

Many high-end embedded systems today are becoming more dynamic in nature, i.e., the functionality and resource requirement changes at run-time under the influence of the environment. State-of-the-art design methodologies identify typical use cases and deal with them separately, in order to reduce the increased complexity of the system. They focus only on the worst case (WC) behavior when dynamism is introduced, mainly because it is the safest approach. In many dynamic systems the WC behavior is not dominant during the application life time, and the system could benefit from a less stringent behavioral constraint, resulting in a more energy efficient system. The system scenario design methodology exploits the dynamism in these systems in order to reduce the energy consumption. Instead of only focusing on the WC behavior of the system, different run-time situations (RTSs) are identified through profiling at design-time. RTSs with similar costs, e.g., execution time, energy consumption, or memory requirement, are clustered into system scenarios. Then instead, the WC behavior of each system scenario is taken into account when designing the system, where the dynamic behavior in applications are exploited. This methodology efficiently tunes system resources (system knobs) according to the applications needs, e.g., by utilizing Dynamic Voltage and/or Frequency Scaling, dynamically turning on and off idle memory banks and other processing units, or by remapping tasks and data into partitions implemented as hardware accelerators. In this paper, we present techniques for scenario prediction and switching in system scenario based designs, where we obtain significant energy reduction on a Arm Cortex M4 based microcontroller board from Atmel.
18:00FM01.23Improving Processor Efficiency through Thermal Modeling and Runtime Management of Hybrid Cooling Strategies
Fulya Kaplan, Boston University, US

One of the main challenges in building future high performance systems is the ability to maintain safe on-chip temperatures in presence of high power densities. Handling such high power densities necessitates novel cooling solutions that are significantly more efficient than their existing counterparts. A number of advanced cooling methods have been proposed to address the temperature problem in processors. However, tradeoffs exist between performance, cost, and efficiency of those cooling methods. Hence, a single cooling solution satisfying optimum conditions for an arbitrary system does not exist. This thesis claims that in order to reach exascale computing, a dramatic improvement in energy efficiency is needed, and this improvement can only be achieved through a temperature-centric co-design of the cooling and computing subsystems, which requires detailed system-level thermal modeling, design-time optimization, and runtime management techniques that are aware of the underlying processor architecture and application requirements. To this end, this thesis first proposes compact modeling methods to characterize the thermal behavior of cutting-edge cooling solutions, mainly Phase Change Material (PCM)-based cooling, liquid cooling, and thermoelectric cooling (TEC), as well as hybrid designs involving a combination of these. The proposed models enable fast and accurate exploration of a large design space. Comparisons against multi-physics simulations and measurements on testbeds validate the accuracy of our models. This thesis then develops temperature-aware design-time and runtime optimization techniques to maximize energy efficiency of a given system as a whole, attacking the major sources of inefficiency. To boost performance of parallel applications, we propose using PCM-based cooling together with our Adaptive Sprinting policy. For systems with high-density hot spots, we apply hybrid cooling techniques that localize the cooling effort over hot spots to improve efficiency. We introduce a design-time optimization method, LoCool, which jointly minimizes cooling power in hybrid-cooled systems using TECs and liquid cooling. Finally, the scope of our work also extends to emerging technologies and their potential benefits and tradeoffs. Integrated flow cell array is on of them, where fuel cells are pumped through microchannels, providing both cooling and on-chip power generation. We explore a broad range of design parameters including maximum chip temperature, leakage power, and generated power for integrating this technology in computer systems.
18:00FM01.24P2NoC: Power- and Performance-Aware Network-on-Chip Architecture for Multi-core Systems
Hemanta Kumar Mondal, IIIT Delhi, IN

Networks-on-Chip (NoCs) are fast becoming the de-facto communication infrastructures in chip multi-processors for large-scale applications. Wireless NoCs (WNoCs) offer a promising solution to reduce the long-distance communication bottlenecks of conventional NoCs by augmenting them with single hop, long-range wireless links. Though highly performance efficient, NoCs consume significant chip power and it increases exponentially with increasing system size and technology node, even with reduced supply voltage. Analysis of network resources for several benchmarks shows that, utilization and hence power consumption is application dependent and the desired performance can be achieved even without operating all resources at maximum specifications. To exploit this, we propose an adaptive two-step hybrid utilization estimation method using stochastic model with low overheads. Based on the router utilization, we propose a low power NoC architecture using power gating and voltage scaling techniques. By implementing power gating technique for individual routers, we achieve leakage power saving and energy-efficient transceiver for idle state power saving. To overcome the power-gating impacts and maintain the performance, we also propose a deadlock-free Seamless Bypass Routing (SBR) strategy that bypass a power-gated router. We also propose Adaptive Multi-Voltage Scaling (AMS) mechanism to achieve significant network energy saving by scaling the voltage of router components. To implement this, we propose a multi-level voltage shifter that allows switching between any two voltage levels from a fixed set of supply voltages. In addition, to enhance the performance of memory accesses in CMPs, we propose an adaptive hybrid switching strategy with dual crossbar router with optimal placement strategy of the memory controllers. To reduce the energy overhead of dual crossbar routers, we introduce partially drowsy and power gated techniques in the proposed architecture. Furthermore, to enhance the performance and energy-efficiency, we also propose an interference-aware WIs placement algorithm with routing strategy for WNoC architecture by incorporating directional planar log-periodic antennas (PLPAs). The proposed architecture saves up to 92.20% leakage power in base routers and 62.50% power consumption in WIs for 256 core system using power gating approach. AMS scheme saves up to 56% in network packet energy consumption as compared to baseline architectures without incurring significant performance penalty and area overheads. The adaptive hybrid switching strategy with dual crossbar scheme improves the peak throughput of the network by 30.28% and reduces the network energy by 31.21% as compared to traditional NoC architectures.
18:00FM01.25Ph.D Thesis Title: Bipolar Resistive Switching of Bi-Layered Pt/Ta2O5/TaOx/Pt RRAM-Physics-based Modelling, Circuit Design and Testing
Firas Hatem, The University of Nottingham Malaysia Campus, MY

Designing analytical and SPICE models is a critical step toward understanding the behavior of the resistive random access memory (RRAM) when integrated in memory circuit design. In this PhD research, a novel physics-based mathematical and SPICE models that describes the bipolar resistive switching (BRS) behavior in the Ta2O5/TaOx bi-layered RRAM with Ta2O5 insulator layer thickness (D) smaller than 5 nm are presented. The current conduction in the developed models is based on Schottky barrier modulation and the tunneling mechanism. The resistive switching (RS) mechanism is based on the electric field variation of the un-doped region of the conductive filament (CF). Extensive simulation is carried out and the results are correlated against the experimental data which show that the models are in good agreement with the physical characteristics of the bi-layered RRAM and matched the experimental results with an average error of < 11%. The SPICE model can be included in the SPICE-compatible circuit simulation and is suitable for the exploration of the Ta2O5/TaOx bi-layered RRAM device performance at circuit level.
18:00FM01.26Edge Computing in Internet of Things (IoT)
Farzad Samie, Lars Bauer and Joerg Henkel, Karlsruhe Institute of Technology, DE

With the advancements in technology and proliferation of portable and mobile IoT devices and their increasing processing capability, we witness that the edge of network is moving to the IoT gateways and smart devices. To avoid Big Data issues (e.g. high latency of cloud based IoT), the processing of the captured data is starting from the IoT edge node. However, the available processing capabilities and energy resources are still limited and do not allow to fully process the data on-board. It calls for efficient mechanisms for offloading some portions of computation to the gateway or servers. Due to the limited bandwidth of the IoT gateways, choosing the offloading levels of connected devices and allocating resources (e.g. bandwidth) to them is a challenging problem. In this research, we aim to address the edge computing challenges in IoT both at the edge devices and gateway. At the device level, we study the efficient on-board processing techniques and optimized partitioning for offloading the computation. At the gateway level, we present mechanisms to manage and allocate the shared resources among the IoT edge devices that are connected to the gateway.
18:00FM01.27A Low Power Heterogeneous SoC for CNN-Based Vision
Francesco Conti, ETH Zurich & University of Bologna, CH

Computer vision (CV) based on Convolutional Neural Networks (CNN) is a rapidly developing field thanks to the flexibility of CNNs, their strong generalization capability and classification accuracy (matching and sometimes exceeding human performance). CNN-based classifiers are typically deployed on servers or high-end embedded platforms. However, their ability to ``compress'' low information density data such as images into highly informative classification tags makes them extremely interesting for wearable and IoT scenarios, should it be possible to fit their computational requirements within deeply embedded devices such as visual sensor nodes. We propose a 65nm System-on-Chip based on the PULP platform that implements a hybrid HW/SW CNN accelerator while meeting this energy efficiency target. The SoC integrates a near-threshold parallel processor cluster and a hardware accelerator for convolution-accumulation operations, which constitute the basic kernel of CNNs: it achieves peak performance of 11.2 GMAC/s @ 1.2V and peak energy efficiency of 261 GMAC/s/W @ 0.65V.
18:00FM01.28Cross-Layer Dependability for Runtime Reconfigurable Architectures
Hongyan Zhang, Lars Bauer and Joerg Henkel, Karlsruhe Institute of Technology, DE

Runtime reconfigurable architectures based on FPGAs are emerging over the recent years as a promising augment to conventional processor architectures such as CPUs and GPUs. Their essential feature, runtime reconfiguration, enables dynamic customization of the hardware organization for changing application requirements. Compute-intensive parts of applications are accelerated on the FPGA. Manufactured in latest technology nodes, modern FPGAs are increasingly prone to various dependability issues. Latent defects not discovered during manufacturing, soft errors in the sequential elements caused by single event upsets and transistor degradation caused by various aging effects are the major dependability issues in safety and mission critical applications. This work proposes a comprehensive solution for fault discovery, fault tolerance and aging mitigation against both permanent fault and soft errors in runtime reconfigurable architectures by answering the following questions: How can we test the system when its hardware organization changes during runtime? If part of the FPGA becomes faulty, how can we isolate those faulty resources such that the system continues operation with minimal performance degradation? How can we further prolong the system lifetime by delaying the failure time of the FPGA? When high reliability of correct computation is required, how can we tailor the accelerator organization to defend the soft errors caused by single-event-upset, even when the environmental condition is changing? And how all of these can be accomplished with minimal hardware and runtime overhead? By exploiting the inherent flexibility provided by runtime reconfigurable architectures, this work addresses the nano-era dependability challenges in a cross-layer fashion, from circuit layer over microarchitecture layer till system software layer, from design time to runtime. It enables runtime reconfigurable architectures to discover a fault with small detection latency, to tolerate the discovered fault with minimal performance degradation and even to prevent the fault in the first place.
18:00FM01.29Designing the Batteryless IoT
Andres Gomez, ETH Zurich, CH

Over the past decade, wireless sensor networks have established themselves as a robust technology with a wide range of applications from smart buildings and cities, infrastructure and environmental monitoring, precision agriculture, and the IoT. One of the driving forces that have made applications feasible has been the push to reduce the power consumption of electronic devices. However, the broader problem of how to supply them with the energy they require in an efficient, low-cost, long-term, self-sustainable manner has not yet been adequately solved. Traditionally, systems designers have used bulky, expensive energy storage devices such as batteries and supercapacitors to power a sensor node in a time scale ranging from a few days to a few years. We minimize the storage element in an application-specific manner such that computational progress can be guaranteed with even minimal input power. Experimental results have shown that these devices can operate very efficiently, and with energy-proportionality, even under highly volatile harvesting scenarios.