IP4 Interactive Presentations

Printer-friendly version PDF version

Date: Thursday 22 March 2018
Time: 10:00 - 10:30
Location / Room: Conference Level, Foyer

Interactive Presentations run simultaneously during a 30-minute slot. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session

LabelPresentation Title
Authors
IP4-1EFFICIENT MAPPING OF QUANTUM CIRCUITS TO THE IBM QX ARCHITECTURES
Speaker:
Alwin Zulehner, Johannes Kepler University Linz, AT
Authors:
Alwin Zulehner, Alexandru Paler and Robert Wille, Johannes Kepler University Linz, AT
Abstract
In March 2017, IBM launched the project IBM Q with the goal to provide access to quantum computers for a broad audience. This allowed users to conduct quantum experiments on a 5-qubit and, since June 2017, also on a 16-qubit quantum computer (called IBM QX2 and IBM QX3, respectively). In order to use these, the desired quantum functionality (e.g. provided in terms of a quantum circuit) has to properly be mapped so that the underlying physical constraints are satisfied - a complex task. This demands for solutions to automatically and efficiently conduct this mapping process. In this paper, we propose such an approach which satisfies all constraints given by the architecture and, at the same time, aims to keep the overhead in terms of additionally required quantum gates minimal. The proposed approach is generic and can easily be configured for future architectures. Experimental evaluations show that the proposed approach clearly outperforms IBM's own mapping solution with respect to runtime as well as resulting costs.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-2PARALLEL CODE GENERATION OF SYNCHRONOUS PROGRAMS FOR A MANY-CORE ARCHITECTURE
Speaker:
Amaury Graillat, Verimag - Univ. Grenoble Alpes, FR
Authors:
Amaury Graillat1, Matthieu Moy2, Pascal Raymond3 and Benoît Dupont de Dinechin4
1Verimag - Univ. Grenoble Alpes, FR; 2Univ. Grenoble Alpes, Verimag, FR; 3VERIMAG/CNRS, FR; 4Kalray, FR
Abstract
AmEmbedded systems tend to require more and more computational power. Many-core architectures are good candi- dates since they offer power and are considered more time predictable than classical multi-cores. Data-flow Synchronous languages such as Lustre or Scade are widely used for avionic critical software. Programs are described by networks of computational nodes. Implementation of such programs on a many-core architecture must ensure a bounded response time and preserve the functional behavior by taking interference into account. We consider the top-level node of a Lustre application as a software architecture description where each sub-node corresponds to a potential parallel task. Given a mapping (tasks to cores), we automatically generate code suitable for the targeted many-core architecture. This minimizes memory interferences and allows usage of a framework to compute the Worst-Case Response Time.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-3SOCRATES - A SEAMLESS ONLINE COMPILER AND SYSTEM RUNTIME AUTOTUNING FRAMEWORK FOR ENERGY-AWARE APPLICATIONS
Speaker:
Gianluca Palermo, Politecnico di Milano, IT
Authors:
Davide Gadioli1, Ricardo Nobre2, Pedro Pinto3, Emanuele Vitali1, Amir H. Ashouri4, Gianluca Palermo1, Cristina Silvano1 and João M. P. Cardoso5
1Politecnico di Milano, IT; 2University of Porto / INESC TEC, PT; 3Faculty of Engineering, University of Porto, PT; 4University of Toronto, Canada, CA; 5University of Porto, PT
Abstract
Configuring program parallelism and selecting optimal compiler options according to the underlying platform architecture is a difficult task if completely demanded to the programmer or done by using a default one-fits-all policy generated by the compiler or runtime system. Given the dynamics of the problem, a runtime selection of the best configuration is obviously the desirable solution. However, implementing this solution into the application requires the insertion of a lot of glue code for profiling and runtime selection. This represents a programming wall to actually make it feasible. This paper presents a structured approach called SOCRATES, based on a Domain Specific Language (LARA) and a runtime autotuner (mARGOt), to alleviate this effort. LARA has been used to hide the glue code insertion, thus separating the pure functional application description from extra-functional requirements. mARGOT has been used for the automatic selection of the best configuration according to the runtime evolution of the application. To demonstrated the effectiveness of the proposed approach, we evaluated SOCRATES by varying the application workloads, hardware resources and energy efficiency requirements for 12 OpenMP Polybench/C with respect to a standard one-fits-all solution.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-4NON-INTRUSIVE PROGRAM TRACING OF NON-PREEMPTIVE MULTITASKING SYSTEMS USING POWER CONSUMPTION
Speaker:
Kamal Lamichhane, University of Waterloo, CA
Authors:
Kamal Lamichhane, Carlos Moreno and Sebastian Fischmeister, University of Waterloo, CA
Abstract
System tracing, runtime monitoring, execution reconstruction are useful techniques for protecting the safety and integrity of systems. Furthermore, with time-aware or overhead-aware techniques being available, these techniques can also be used to monitor and secure production systems. As operating systems gain in popularity, even in deeply embedded systems, these techniques face the challenge to support multitasking. In this paper, we propose a novel non-intrusive technique, which efficiently reconstructs the execution trace of non-preemptive multitasking system by observing power consumption characteristics. Our technique uses the control-flow graph (CFG) of the application program to identify the most likely block of code that the system is executing at any given point in time. For the purpose of the experimental evaluation, we first instrument the source code to obtain power consumption information for each basic block, which is used as the training data for our Dynamic Time Warping and k-Nearest Neighbours (k-NN) classifier. Once the system is trained, this technique is used to identify live code-block execution (LCBE). We show that the technique can reconstruct the execution flow of programs in a multi-tasking environment with high accuracy.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-5ENERGY-PERFORMANCE DESIGN EXPLORATION OF A LOW-POWER MICROPROGRAMMED DEEP-LEARNING ACCELERATOR
Speaker:
Andrea Calimera, Politecnico di Torino, IT
Authors:
Andrea Calimera1, Mario R. Casu2, Giulia Santoro1, Valentino Peluso1 and Massimo Alioto3
1Politecnico di Torino, IT; 2Politecnico di Torino, Department of Electronics and Telecommunications, IT; 3National University of Singapore, SG
Abstract
This paper presents the design space exploration of a novel microprogrammable accelerator in which PEs are connected with a Network-on-Chip and benefit from low-power features enabled through a practical implementation of a Dual- Vdd assignment scheme. An analytical model, fitted with postlayout data obtained with a 28nm FDSOI design kit, returns implementations with optimal energy-performance tradeoff by taking into consideration all the key design-space variables. The obtained Pareto analysis helps us infer optimization rules aimed at improving quality of design.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-6GENPIM: GENERALIZED PROCESSING IN-MEMORY TO ACCELERATE DATA INTENSIVE APPLICATIONS
Speaker:
Tajana Rosing, UC San Diego, US
Authors:
Mohsen Imani, Saransh Gupta and Tajana Rosing, University of California, San Diego, US
Abstract
Big data has become a serious problem as data volumes have been skyrocketing for the past few years. Storage and CPU technologies are overwhelmed by the amount of data they have to handle. Traditional computer architectures show poor performance which processing such huge data. Processing in-memory is a promising technique to address data movement issue by locally processing data inside memory. However, there are two main issues with stand-alone PIM designs: (i) PIM is not always computationally faster than CMOS logic, (ii) PIM cannot process all operations in many applications. Thus, not many applications can benefit from PIM. To generalize the use of PIM, we designed GenPIM, a general processing in-memory architecture consisting of the conventional processor as well as the PIM accelerators. GenPIM supports basic PIM functionalities in specialized non-volatile memory including: bitwise operations, search operation, addition and multiplication. For each application, GenPIM identifies the part which uses PIM operations, and processes the rest of non-PIM operations or not data intensive part of applications in general purpose cores. GenPIM also enables configurable PIM approximation by relaxing in-memory computation. We test the efficiency of proposed design over different emerging machine learning, compression and security applications. Our experimental evaluation shows that our design can achieve 10.9x improvement in energy efficiency and 6.4x speedup as compared to processing data in conventional cores. The results can be improved by 21.0% in energy consumption and 30.6% in performance by enabling PIM approximation while ensuring less than 2% quality loss.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-7UNIVERSAL NUMBER POSIT ARITHMETIC GENERATOR ON FPGA
Speaker:
Hayden K.-H. So, The University of Hong Kong, HK
Authors:
Manish Kumar Jaiswal and Hayden So, The University of Hong Kong, HK
Abstract
Posit number system format includes a run-time varying exponent component, defined by a combination of regime-bit (with run-time varying length) and exponent-bit (with size of up to ES bits, the exponent size). This also leads to a run-time variation in its mantissa field size and position. This run-time variation in posit format poses a hardware design challenge. Being a recent development, posit lacks for its adequate hardware arithmetic architectures. Thus, this paper is aimed towards the posit arithmetic algorithmic development and their generic hardware generator. It is focused on basic posit arithmetic (floating-point to posit conversion, posit to floating point con- version, addition/subtraction and multiplication). These are also demonstrated on a FPGA platform. Target is to develop an open- source solution for generating basic posit arithmetic architectures with parameterized choices. This contribution would enable further exploration and evaluation of posit system.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-8BLOCK CONVOLUTION: TOWARDS MEMORY-EFFICIENT INFERENCE OF LARGE-SCALE CNNS ON FPGA
Speaker:
Gang Li, Institute of Automation, Chinese Academy of Sciences, CN
Authors:
Gang Li, Fanrong Li, Tianli Zhao and Jian Cheng, Institute of Automation, Chinese Academy of Sciences, CN
Abstract
FPGA-based CNN accelerators are gaining popularity due to high energy efficiency and great flexibility in recent years. However, as the networks grow in depth and width, the great volume of intermediate data is too large to store on chip, data transfers between on-chip memory and off-chip memory should be frequently executed, which leads to unexpected off-chip memory access latency and energy consumption. In this paper, we propose a block convolution approach, which is a memory-efficient, simple yet effective block-based convolution to completely avoid intermediate data from streaming out to off-chip memory during network inference. Experiments on the very large VGG-16 network show that the improved top-1/top-5 accuracy of 72.60%/91.10% can be achieved on the ImageNet classification task with the proposed approach. As a case study, we implement the VGG-16 network with block convolution on Xilinx Zynq ZC706 board, achieving a frame rate of 12.19fps under 150MHz working frequency, with all intermediate data staying on chip.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-9EXAMINING THE CONSEQUENCES OF HIGH-LEVEL SYNTHESIS OPTIMIZATIONS ON THE POWER SIDE CHANNEL
Speaker:
Lu Zhang, Northwestern Polytechnical University, CN
Authors:
Lu Zhang1, Wei Hu2, Armaiti Ardeshiricham2, Yu Tai1, Jeremy Blackstone2, Dejun Mu1 and Ryan Kastner2
1Northwestern Polytechnical University, CN; 2University of California, San Diego, US
Abstract
High-level synthesis (HLS) allows hardware designers to think algorithmically and not have to worry about low-level, cycle-by-cycle details. This provides the ability to quickly explore the architectural design space and tradeoff between resource utilization and performance. Unfortunately, evaluating the security is not a standard part of the HLS design flow. In this work, we aim to understand the effects of HLS optimizations with respect to power side-channel leakage. We use Vivado HLS to develop different cryptographic cores, implement them on a Xilinx Spartan 6 FPGA, and collect power traces. We evaluate the designs with respect to resource utilization, performance, and side-channel leakage through power consumption. Furthermore, we analyze the first-order leakage of the HLS-based designs alongside well-known register transfer level (RTL) cryptographic cores. We describe an evaluation procedure for hardware designers and use it to make insightful recommendations on how to design the best architecture in cryptographic domain.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-10DFARPA: DIFFERENTIAL FAULT ATTACK RESISTANT PHYSICAL DESIGN AUTOMATION
Speaker:
Debdeep Mukhopadhyay, Indian Institute of Technology Kharagpur, IN
Authors:
Mustafa Khairallah1, Rajat Sadhukhan2, Radhamanjari Samanta2, Jakub Breier1, Shivam Bhasin3, Rajat Subhra Chakraborty2, Anupam Chattopadhyay1 and Debdeep Mukhopadhyay2
1Nanyang Technological University, SG; 2Indian Institute of Technology Kharagpur, IN; 3Temasek Laboratories, Nanyang Technological University, SG
Abstract
Differential Fault Analysis (DFA), aided by sophisticated mathematical analysis techniques for ciphers and precise fault injection methodologies, has become a potent threat to cryptographic implementations. In this paper, we propose, to the best of the our knowledge, the first ``DFA-aware" physical design automation methodology, that effectively mitigates the threat posed by DFA. We first develop a novel floorplan heuristic, which resists the simultaneous corruption of cipher states necessary for successful fault attack, by exploiting the fact that most fault injections are localized in practice. Our technique results in the computational complexity of the fault attack to shoot up to exhaustive search levels, making them practically infeasible. In the second part of the work, we develop a routing mechanism, which tackles more precise and costly fault injection techniques, like laser and electromagnetic guns. We propose a routing technique by integrating a specially designed ring oscillator based sensor circuit around the potential fault attack targets without incurring any performance overhead.We demonstrate the effectiveness of our technique by applying it on state of the art ciphers.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-11AN ENERGY-EFFICIENT STOCHASTIC COMPUTATIONAL DEEP BELIEF NETWORK
Speaker:
Yidong Liu, University of Alberta, CA
Authors:
Yidong Liu1, Yanzhi Wang2, Fabrizio Lombardi3 and Jie Han1
1University of Alberta, CA; 2Syracuse university, US; 3Northeastern University, US
Abstract
Deep neural networks (DNNs) are effective machine learning models to solve a large class of recognition problems, including the classification of nonlinearly separable patterns. The applications of DNNs are, however, limited by the large size and high energy consumption of the networks. Recently, stochastic computation (SC) has been considered to implement DNNs to reduce the hardware cost. However, it requires a large number of random number generators (RNGs) that lower the energy efficiency of the network. To overcome these limitations, we propose the design of an energy-efficient deep belief network (DBN) based on stochastic computation. An approximate SC activation unit (A-SCAU) is designed to implement different types of activation functions in the neurons. The A-SCAU is immune to signal correlations, so the RNGs can be shared among all neurons in the same layer with no accuracy loss. The area and energy of the proposed design are 5.27% and 3.31% (or 26.55% and 29.89%) of a 32-bit floating-point (or an 8-bit fixed-point) implementation. It is shown that the proposed SC-DBN design achieves a higher classification accuracy compared to the fixed-point implementation. The accuracy is only lower by 0.12% than the floating-point design at a similar computation speed, but with a significantly lower energy consumption.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-12PUSHING THE NUMBER OF QUBITS BELOW THE "MINIMUM": REALIZING COMPACT BOOLEAN COMPONENTS FOR QUANTUM LOGIC
Speaker:
Alwin Zulehner, Johannes Kepler University Linz, AT
Authors:
Alwin Zulehner and Robert Wille, Johannes Kepler University Linz, AT
Abstract
Research on quantum computers has gained attention since they are able to solve certain tasks significantly faster than classical machines (in some cases, exponential speed-ups are possible). Since quantum computations typically contain large Boolean components, design automation techniques are required to realize the respective Boolean functions in quantum logic. They usually introduce a significant amount of additional qubits - a highly limited resource. In this work, we propose an alternative method for the realization of Boolean components for quantum logic. In contrast to the current state-of-the-art, we dedicatedly address the main reasons causing the additionally required qubits (namely the number of the most frequently occurring output pattern as well as the number of primary outputs of the function to be realized) and propose to manipulate the function so that both issues are addressed. The resulting methods allow to push the number of required qubits below what is currently considered the minimum.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-13POWER OPTIMIZATION THROUGH PERIPHERAL CIRCUIT REUSING INTEGRATED WITH LOOP TILING FOR RRAM CROSSBAR-BASED CNN
Authors:
Yuanhui Ni, Weiwen Chen and Keni Qiu, Capital Normal University, CN
Abstract
Convolutional neural networks (CNNs) have been proposed to be widely adopted to make predictions on a large amount of data in modern embedded systems. Prior studies have shown that convolutional computations which consist of numbers of multiply and accumulate (MAC) operations, serve as the most computationally expensive portion in CNN. Compared to the manner of executing MAC operations in GPU and FPGA, CNN implementation in the RRAM crossbar-based computing system (RCS) demonstrates the outstanding advantages of high performance and low power. However, the current design is energy-unbalanced among the three parts of RRAM crossbar computation, peripheral circuits and memory accesses, the latter two factors can significantly limit the potential gains of RCS. Addressing the problem of high power overhead of peripheral circuits in RCS, this paper adopts the Peripheral Circuit Unit (PeriCU)-Reuse scheme to meet a certain power budget. The underlying idea is to put the expensive AD/DAs onto spotlight and arrange multiple convolution layers to be sequentially served by the same PeriCU. Furthermore, it is observed that memory accesses can be bypassed if two adjacent layers are assigned in the different PeriCU. Then a loop tiling technique is proposed to further improve the energy and throughput of RCS. The experiments of two convolutional applications validate that the PeriCU-Reuse scheme integrated with the loop tiling techniques can efficiently meet power requirement, and further reduce energy consumption by 61.7%.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-14ORIENT: ORGANIZED INTERLEAVED ECCS FOR NEW STT-RAM CACHES
Speaker:
Hamed Farbeh, School of Computer Science, Institute for Research in Fundamental Sciences (IPM), IR
Authors:
Zahra Azad1, Hamed Farbeh2 and Amir Mahdi Hosseini Monazzah3
1Sharif University of Technology, IR; 2School of Computer Science, Institute for Research in Fundamental Sciences (IPM), IR; 3Department of Computer Engineering, Sharif University of Technology, Tehran, Iran, IR
Abstract
Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM) is a promising alternative to SRAM in cache memories. However, STT-MRAMs face with high probability of write errors due to its stochastic switching behavior. To correct the write errors, Error-Correcting Codes (ECCs) used in SRAM caches are conventionally employed. A cache line consists of several codewords and the data bits are selected in such a way that the maximum correction capability is provided based on the error patterns in SRAMs. However, the different write error patterns in STT-MRAM caches leads to inefficiency of conventional ECC configurations. In this paper, first we investigate the efficiency of ECC configurations and demonstrate that the vulnerability of codewords in a cache line varies by up to 17x. This variation means that, while some words are overprotected, some others are highly probable to experience uncorrectable errors. Then, we propose an ECC bit selection scheme, so-called ORIENT, to reduce the vulnerability variation of codewords to 1.4x. The simulation results show that conventional ECC configuration increases the write error rate by up to about 64.4% compared with the optimum ECC bit selection, whereas this value for ORIENT is only 4.5%.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-15ERASMUS: EFFICIENT REMOTE ATTESTATION VIA SELF-MEASUREMENT FOR UNATTENDED SETTINGS
Speaker:
Norrathep Rattanavipanon, University of California, Irvine, TH
Authors:
Xavier Carpent1, Norrathep Rattanavipanon2 and Gene Tsudik2
1UC Irvine, US; 2UCI, US
Abstract
Remote attestation (RA) is a popular means of detecting malware in embedded and IoT devices. RA is usually realized as a protocol via which a trusted verifier measures software integrity of an untrusted remote device called prover. All prior RA techniques require on-demand operation. We identify two drawbacks of this approach in the context of unattended devices: First, it fails to detect mobile malware that enters and leaves the prover between successive RA instances. Second, it requires the prover to engage in a potentially expensive computation, which can negatively impact safety-critical or real-time devices. To this end, we introduce the concept of self-measurement whereby a prover periodically (and securely) measures and records its own software state. A verifier then collects and verifies these measurements. We demonstrate a concrete technique called ERASMUS, justify its features and evaluate its performance. We show that ERASMUS is well-suited for safety-critical applications. We also define a new metric -- Quality of Attestation (QoA).

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-16END-TO-END LATENCY ANALYSIS OF CAUSE-EFFECT CHAINS IN AN ENGINE MANAGEMENT SYSTEM
Speaker:
Junchul Choi, Seoul National University, KR
Authors:
Junchul Choi, Donghyun Kang and Soonhoi Ha, Seoul National University, KR
Abstract
An engine management system consists of periodic or sporadic real-time tasks. A task is a set of runnables that may be fully preemptive or partially at runnable boundaries. A cause-effect chain is defined as a chain of runnables that are connected by the read/write dependency. We propose a novel analytical technique to estimate the end-to-end latency of a cause-effect chain by considering conservatively estimated schedule time bounds of associated runnables. The proposed approach is verified with an industrial-strength automotive benchmark.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-17GENERAL FLOORPLANNING METHODOLOGY FOR 3D ICS WITH AN ARBITRARY BONDING STYLE
Speaker:
Chien-Yu Huang, Department of Electrical Engineering, National Cheng Kung University, TW
Authors:
Jai-Ming Lin and Chien-Yu Huang, Department of Electrical Engineering, National Cheng Kung University, TW
Abstract
This paper proposes a general floorplanning methodology which can be applied to 3D ICs with an arbitrary bonding style. Some researches have shown that a 3D IC with the hybrid bonding style, which includes face-to-back and face-to-face, may obtain better results than that simply using the face-to-back bonding style. We respectively present an approach to assign modules to tiers for each kind of bonding style. Further, a new utilization function, called cosine-shaped function, is proposed to estimate utilizations of bins required by the analytical-based approach. Our experimental results show the cosine shaped function can obtain a little better result than the bell-shaped function on IBM benchmarks for 2D floorplanning. We also show that the proposed 3D floorplanning methodology consumes less TSVs and induces shorter wirelength compared to previous work in the hybrid bonding style.

Download Paper (PDF; Only available from the DATE venue WiFi)