IP4 Interactive Presentations

Printer-friendly version PDF version

Date: Thursday 30 March 2017
Time: 10:00 - 10:30
Location / Room: IP sessions (in front of rooms 4A and 5A)

Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the morning. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award 'Best IP of the Day' is given.

LabelPresentation Title
Authors
IP4-11024-CHANNEL 3D ULTRASOUND DIGITAL BEAMFORMER IN A SINGLE 5W FPGA
Speaker:
Aya Ibrahim, EPFL, CH
Authors:
Federico Angiolini1, Aya Ibrahim1, William Simon1, Ahmet Caner Yüzügüler1, Marcel Arditi1, Jean-Philippe Thiran1 and Giovanni De Micheli2
1EPFL, CH; 2École Polytechnique Fédérale de Lausanne (EPFL), CH
Abstract
3D ultrasound, an emerging medical imaging tech- nique that is presently only used in hospitals, has the potential to enable breakthrough telemedicine applications, provided that its cost and power dissipation can be minimized. In this paper, we present an FPGA architecture suitable for a portable medical 3D ultrasound device. We show an optimized design for the digital part of the imager, including the delay calculation block, which is its most critical part. Our computationally efficient approach requires a single FPGA for 3D imaging, which is unprecedented. The design is scalable; a configuration supporting a 32×32- channel probe, which enables high-quality imaging, consumes only about 5W.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-2LAANT: A LIBRARY TO AUTOMATICALLY OPTIMIZE EDP FOR OPENMP APPLICATIONS
Speaker:
Arthur Francisco Lorenzon, Federal University of Rio Grande do Sul, BR
Authors:
Arthur Lorenzon, Jeckson Dellagostin Souza and Antonio Carlos Schneider Beck Filho, Universidade Federal do Rio Grande do Sul, BR
Abstract
Efficiently exploiting thread level parallelism from new multicore systems has been challenging for software developers. While blindly increasing the number of threads may lead to performance gains, it can also result in disproportionate increase in energy consumption. For this reason, rightly choosing the number of threads is essential to reach the best compromise between both. However, such task is extremely difficult: besides the huge number of variables involved, many of them will change according to different aspects of the system at hand and are only possible to be defined at run-time. To address this complex scenario, we propose LAANT, a novel library to automatically find the optimal number of threads for OpenMP applications, by dynamically considering their particular characteristics, input set, and the processor architecture. By executing nine well-known benchmarks on three real multicore processors, LAANT improves the EDP (Energy-Delay Product) by up to 61%, compared to the standard OpenMP execution; and by 44%, when the dynamic adjustment of the number of threads of OpenMP is activated.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-3IMPROVING THE ACCURACY OF THE LEAKAGE POWER ESTIMATION OF EMBEDDED CPUS
Speaker:
Shiao-Li Tsao, National Chiao Tung University, TW
Authors:
Ting-Wu Chin, Shiao-Li Tsao, Kuo-Wei Hung and Pei-Shu Huang, National Chiao Tung University, TW
Abstract
Previous studies have used on-chip thermal sensors (diodes) to estimate the leakage power of a CPU. However, an embedded CPU equips only a few thermal sensors and may suffer from considerable spatial temperature variances across the CPU core, and leakage power estimation based on insufficient temperature information introduces errors. According to our experiments, the conventional leakage power models may have up to 22.9% estimation error for a 70-nm embedded CPU. In this study, we first evaluated the accuracy of leakage power estimates based on thermal sensors on different locations of a CPU and suggested locations that can reduce the error to 0.9%. Then, we proposed temperature-referred and counter-tracked estimation (TRACE) that relies on temperature sensors and hardware activity counters to estimate leakage power. The simulation results demonstrated that employing TRACE could reduce the error to 3.4%. Experiments were also conducted on a real platform to verify our findings.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-4SCHEDULE-AWARE LOOP PARALLELIZATION FOR EMBEDDED MPSOCS BY EXPLOITING PARALLEL SLACK
Speaker:
Miguel Angel Aguilar, RWTH Aachen University, DE
Authors:
Miguel Angel Aguilar1, Rainer Leupers1, Gerd Ascheid1, Nikolaos Kavvadias2 and Liam Fitzpatrick2
1RWTH Aachen University, DE; 2Silexica Software Solutions GmbH, DE
Abstract
MPSoC programming is still a challenging task, where several aspects have to be taken into account to achieve a profitable parallel execution. Selecting a proper scheduling policy is an aspect that has a major impact on the performance. OpenMP is an example of a programming paradigm that allows to specify the scheduling policy on a per loop basis. However, choosing the best scheduling policy and the corresponding parameters is not a trivial task. In fact, there is already a large amount of software parallelized with OpenMP, where the scheduling policy is not explicitly specified. Then, the scheduling decision is left to the default runtime, which in most of the cases does not yield the best performance. In this paper, we present a schedule-aware optimization approach enabled by exploiting the parallel slack existing in loops parallelized with OpenMP. Results on an embedded multicore device, show that the performance achieved by OpenMP loops optimized with our approach outperform by up to 93%, the performance achieved by the original OpenMP loops, where the scheduling policy is not specified.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-5REDUCING CODE MANAGEMENT OVERHEAD IN SOFTWARE-MANAGED MULTICORES
Speaker:
Aviral Shrivastava, Arizona State University, US
Authors:
Jian Cai1, Yooseong Kim1, Youngbin Kim2, Aviral Shrivastava1 and Kyoungwoo Lee2
1Arizona State University, US; 2Yonsei University, KR
Abstract
Software-managed architectures, which use scratch- pad memories (SPMs), are a promising alternative to cached- based architectures for multicores. SPMs provide scalability but require explicit management. For example, to use an instruction SPM, explicit management code needs to be inserted around every call site to load functions to the SPM. such management code would check the state of the SPM and perform loading operations if necessary, which can cause considerable overhead at runtime. In this paper, we propose a compiler-based approach to reduce this overhead by identifying management code that can be removed or simplified. Our experiments with various benchmarks show that our approach reduces the execution time by 14% on average. In addition, compared to hardware caching, using our approach on an SPM-based architecture can reduce the execution times of the benchmarks by up to 15%

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-6PERFORMANCE EVALUATION AND OPTIMIZATION OF HBM-ENABLED GPU FOR DATA-INTENSIVE APPLICATIONS
Speaker:
Yuan Xie, University of California, Santa Barbara, US
Authors:
Maohua Zhu1, Youwei Zhuo2, Chao Wang3, Wenguang Chen4 and Yuan Xie1
1University of California, Santa Barbara, US; 2University of Southern California, US; 3University of Science and Technology of China, CN; 4Tsinghua University, CN
Abstract
Graphics Processing Units (GPUs) are widely used to accelerate data-intensive applications. To improve the performance of data-intensive applications, higher GPU memory bandwidth is desirable. Traditional GDDR memories achieve higher bandwidth by increasing frequency, which leads to excessive power consumption. Recently, a new memory technology called high-bandwidth memory (HBM) based on 3D die-stacking technology has been used in the latest generation of GPUs, which can provide both high bandwidth and low power consumption with in-package stacked DRAM memory. However, the capacity of integrated in-packaged stacked memory is limited (e.g. only 4GB for the state-of-the-art HBM-enabled GPU, AMD Radeon Fury X). In this paper, we implement two representative data-intensive applications, convolutional neural network (CNN) and breadth-first search (BFS) on an HBM-enabled GPU to evaluate the improvement brought by the adoption of the HBM, and investigate techniques to fully unleash the benefits of such HBM-enabled GPU. Based on the evaluation results, we first propose a software pipeline to alleviate the capacity limitation of the HBM for CNN. We then design two programming techniques to improve the utilization of memory bandwidth for BFS application. Experiment results demonstrate that our pipelined CNN training achieves a 1.63x speedup on an HBM enabled GPU compared with the best high-performance GPU in market, and the two optimization techniques for the BFS algorithm make it at most 24.5x(9.8x and 2.5x for each technique, respectively) faster than conventional implementations.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-7DAC: DEDUP-ASSISTED COMPRESSION SCHEME FOR IMPROVING LIFETIME OF NAND STORAGE SYSTEMS
Speaker:
Jisung Park, Seoul National University, KR
Authors:
Jisung Park1, Sungjin Lee2 and Jihong Kim1
1Seoul National University, KR; 2Inha University, KR
Abstract
Thanks to an aggressive scaling of semiconductor devices, the capacity of NAND flash-based solid-state-drives (SSDs) has increased greatly. However, this benefit comes at the expense of a serious degradation of NAND device's lifetime. In order to improve the lifetime of flash-based SSDs, various data reduction techniques, such as deduplication, lossless compression, and delta compression, are rapidly adopted to SSDs. Although each technique has been extensively studied, how to efficiently combine these techniques for maximizing their synergy effects is not investigated well. In this paper, we propose a novel dedup-assisted compression (DAC) scheme that integrates existing data reduction techniques so that potential benefits of individual ones can be maximized while overcoming their inherent limitations. By doing so, DAC greatly reduces the amount of write traffic sent to SSDs. DAC also requires negligible hardware resources by utilizing existing hardware modules. Our evaluation results show that the proposed DAC decreases the amount of written data by up to 30% over a simple integration reduplication and lossless compression.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-8LIFETIME ADAPTIVE ECC IN NAND FLASH PAGE MANAGEMENT
Speaker:
Shunzhuo Wang, Huazhong University of Science and Technology, CN
Authors:
Shunzhuo Wang1, Fei Wu1, Zhonghai Lu2, You Zhou1, Qin Xiong1, Meng Zhang1 and Changsheng Xie1
1Huazhong University of Science and Technology, CN; 2KTH Royal Institute of Technology, SE
Abstract
With increasing density, NAND flash memory has decreasing reliability. Furthermore, raw bit error rate (RBER) of flash memory grows at an exponential rate as program/erase (P/E) cycle increases. Thus, error correction codes (ECCs), usually stored in the out-of-band area (OOB) of flash pages, are widely employed to ensure the reliability. However, the worstcase oriented ECC is largely under-utilized in the early stage, i.e. when P/E cycles are small, and the required ECC redundancy may be too large to be stored in the OOB. In this paper, we propose LAE-FTL, which employs a lifetimeadaptive ECC scheme, to improve the performance and lifetime of NAND flash memory. In the early stage, weak ECCs can guarantee the reliability and the OOB is large enough to store the ECCs. Thus, LAE-FTL employs weak ECCs and adaptively uses small and incremental codewords as P/E cycle increases to improve data transfer and decoding parallelism. In the late stage with large P/E cycles, strong ECCs are needed and the ECC redundancies become too large to fit in the OOB. Thus, LAE-FTL stores the exceeding ECC redundancies in the data space of flash pages and stores user data in a cross-page fashion. Finally, our evaluation results of trace-driven simulations show that LAE-FTL improves the read performance by up to 63.42%, compared to the worst-case oriented ECC scheme in the early stage, and significantly improve reliability of flash memory at low data accessing overhead in the late stage.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-93D-DPE: A 3D HIGH-BANDWIDTH DOT-PRODUCT ENGINE FOR HIGH-PERFORMANCE NEUROMORPHIC COMPUTING
Speaker:
Miguel Lastras-Montaño, University of California, Santa Barbara, US
Authors:
Miguel Angel Lastras-Montaño1, Bhaswar Chakrabarti1, Dmitri B. Strukov1 and Kwang-Ting Cheng2
1UC Santa Barbara, US; 2HKUST, HK
Abstract
We present and experimentally validate 3D-DPE, a general-purpose dot-product engine, which is ideal for accelerating artificial neural networks (ANNs). 3D-DPE is based on a monolithically integrated 3D CMOS-memristor hybrid circuit and performs a high-dimensional dot-product operation (a recurrent and computationally expensive operation in ANNs) within a single step, using analog current-based computing. 3D-DPE is made up of two subsystems, namely a CMOS subsystem serving as the memory controller and an analog memory subsystem consisting of multiple layers of high-density memristive crossbar arrays fabricated on top of the CMOS subsystem. Their integration is based on a high-density area-distributed interface, resulting in much higher connectivity between the two subsystems, compared to the traditional interface of a 2D system or a 3D system integrated using through silicon vias. As a result, 3D-DPE's single-step dot-product operation is not limited by the memory bandwidth, and the input dimension of the operations scales well with the capacity of the 3D memristive arrays. To demonstrate the feasibility of 3D-DPE, we designed and fabricated a CMOS memory controller and monolitically integrated 2 layers of titanium-oxide memristive crossbars. Then we performed the analog dot-product operation under different input conditions in two scenarios: (1) with devices within the same crossbar layer and (2) with devices from different layers. In both cases, the devices exhibited low voltage operation and analog switching behavior with high tuning accuracy.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-10A SCHEDULABILITY TEST FOR SOFTWARE MIGRATION ON MULTICORE SYSTEMS
Speaker:
Jung-Eun Kim, Department of Computer Science at the University of Illinois at Urbana-Champaign, US
Authors:
Jung-Eun Kim1, Richard Bradford2, Tarek Abdelzaher3 and Lui Sha3
1Department of Computer Science, University of Illinois at Urbana-Champaign, US; 2Rockwell Collins, Cedar Rapids, IA, US; 3University of Illinois, US
Abstract
This paper presents a new schedulability test for safety-critical software undergoing a transition from single-core to multicore systems - a challenge faced by multiple industries today. Our migration model consists of a schedulability test and execution model. Its properties enable us to obtain a utilization bound that places an allowable limit on total task execution times. Evaluation results demonstrate the advantages of our scheduling model over competing resource partitioning approaches, such as Periodic Server and TDMA.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-11ADAPTIVE POWER DELIVERY SYSTEM MANAGEMENT FOR MANY-CORE PROCESSORS WITH ON/OFF-CHIP VOLTAGE REGULATORS
Speaker:
Haoran Li, The Hong Kong University of Science and Technology, HK
Authors:
Haoran Li, Jiang Xu, Zhe Wang, Peng Yang, Rafael Kioji Vivas Maeda and Zhongyuan Tian, The Hong Kong University of Science and Technology, HK
Abstract
The power delivery system (PDS) plays a crucial role of guaranteeing the proper functionality of many-core processors. However, as PDS is usually optimized to provide power to the target chip at its best performance level, its energy efficiency can be seriously degraded under highly dynamic workloads, making it a major source of system power losses. On-chip voltage regulators (VR), which are able to achieve fast and fine-grained power control, have been popular choices for PDS implementation and provided design opportunities for improving system energy efficiency. In this paper, we propose the adaptive Quantized Power Management (QPM) scheme to dynamically adjust the PDS with both on-chip and off-chip VRs based on run-time workloads. Experimental results on different applications show that QPM applied on a hybrid PDS with both on/off-chip voltage regulators(VR) achieves 74.1% average overall energy efficiency, 12.3% higher than the conventional PDS with single off-chip VR.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-12FLYING AND DECOUPLING CAPACITANCE OPTIMIZATION FOR AREA-CONSTRAINED ON-CHIP SWITCHED-CAPACITOR VOLTAGE REGULATORS
Speaker:
Xiaoyang Mi, Arizona State University, US
Authors:
Xiaoyang Mi1, Hesam Fathi Moghadam2 and Jae-sun Seo1
1Arizona State University, US; 2Oracle Corporation, US
Abstract
Switched-capacitor voltage regulators (SCVRs) are widely used in on-chip power management, due to high step-down efficiency and feasibility of integration. In this work, we present theoretical analysis and optimization methodology for flying and decoupling capacitance values for area-constrained on-chip SCVRs to achieve the highest system-level power efficiency. The proposed models for efficiency and droop voltage are validated with on-chip 2:1 SCVR implementations in both 65nm and 32nm CMOS, which show high model accuracy. The maximum and average error of the predicted optimal ratio between flying and decoupling capacitance are 5% and 1.7%, respectively.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-13ENHANCING ANALOG YIELD OPTIMIZATION FOR VARIATION-AWARE CIRCUITS SIZING
Speaker:
Ons Lahiouel, Concordia University, CA
Authors:
Ons Lahiouel, Mohamed H. Zaki and Sofiene Tahar, Concordia University, CA
Abstract
This paper presents a novel approach for improving automated analog yield optimization using a two step exploration strategy. First, a global optimization phase relies on a modified Lipschitizian optimization to sample the potential optimal sub-regions of the feasible design space. The search locates a design point near the optimal solution that is used as a starting point by a local optimization phase. The local search constructs linear interpolating surrogate models of the yield to explore the basin of convergence and to rapidly reach the global optimum. Experimental results show that our approach locates higher quality design points in terms of yield rate within less run time and without affecting the accuracy.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-14A NEW SAMPLING TECHNIQUE FOR MONTE CARLO-BASED STATISTICAL CIRCUIT ANALYSIS
Speaker:
Hiwa Mahmoudi, Vienna University of Technology, AT
Authors:
Hiwa Mahmoudi and Horst Zimmermann, Vienna University of Technology, AT
Abstract
Variability is a fundamental issue which gets exponentially worse as CMOS technology shrinks. Therefore, characterization of statistical variations has become an important part of the design phase. Monte Carlo-based simulation method is a standard technique for statistical analysis and modeling of integrated circuits. However, crude Monte Carlo sampling based on pseudorandom selection of parameter variations suffers from low convergence rates and thus, providing high accuracy is computationally expensive. In this work, we present an extensive study on the performance of two widely used techniques, Latin Hypercube and Low Discrepancy sampling methods, and compare their speed-up and accuracy performance properties. It is shown that these methods can exhibit a better efficiency as compared to the pseudorandom sampling but only in limited applications. Therefore, we propose a new sampling scheme that exploits the benefits of both methods by combining them. Through representative circuit examples, it is shown that the proposed sampling technique provides a major improvement in terms of computational effort and offers better properties as compared to each solely.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-15AUTOMATIC TECHNOLOGY MIGRATION OF ANALOG IC DESIGNS USING GENERIC CELL LIBRARIES
Speaker:
Nuno Horta, Instituto de Telecomunicações / Instituto Superior Técnico, PT
Authors:
Jose Cachaco1, Nuno Machado1, Nuno Lourenco1, Jorge Guilherme2 and Nuno Horta3
1Instituto de Telecomunicacoes/Instituto Superior Tecnico, PT; 2Instituto de Telecomunicacoes/Instituto Politecnico de Tomar, PT; 3Instituto de Telecomunicações/Instituto Superior Técnico, PT
Abstract
This paper addresses the problem of automatic technology migration of analog IC designs. The proposed approach introduces a new level of abstraction, for EDA tools addressing analog IC design, allowing a systematic and effortless adaption of a design to a new technology. The new abstraction level is based on generic cell libraries, which includes topology and testbenches descriptions for specific circuit classes. The new approach is implemented and tested using a state-of-the-art multi-objective multi-constraint circuit-level optimization tool, and is validated for the sizing and optimization of continuous-time comparators, including technology migration between two different design nodes, respectively, XFAB 350 nm technology (XH035) and ATMEL 150 nm SOI technology (AT77K).

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-16NOISE-SENSITIVE FEEDBACK LOOP IDENTIFICATION IN LINEAR TIME-VARYING ANALOG CIRCUITS
Speaker:
Peng Li, Texas A&M University, US
Authors:
Ang Li1, Peng Li1, Tingwen Huang2 and Edgar Sánchez-Sinencio1
1Texas A&M University, US; 2Texas A&M University at Qatar, QA
Abstract
The continuing scaling of VLSI technology and design complexity has rendered robustness of analog circuits a significant concern. Parasitic effects may introduce unexpected marginal instability within multiple noise-sensitive loops and hence jeopardize circuit operation and processing precision. The Loop Finder algorithm has been recently proposed to allow detection of noise-sensitive return loops for circuits that are described using a linear time-invariant (LTI) system model. However, many practical circuits such as switched-capacitor filters and mixers present time-varying behaviors which are intrinsically coupled with noise propagation and introduce new noise generation mechanisms. For the first time, we take an in-depth look into the marginal instability of linear periodically time-varying (LPTV) analog circuits and further develop an algorithm for efficient identification of noise-sensitive loops, unifying the solution to noise sensitivity analysis for both LTI and LPTV circuits.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-17CANDY-TM: COMPARATIVE ANALYSIS OF DYNAMIC THERMAL MANAGEMENT IN MANY-CORES USING MODEL CHECKING
Speaker:
Muhammad Shafique, Institute of Computer Engineering, Vienna University of Technology (TU Wien), AT
Authors:
Syed Ali Asadullah Bukhari1, Faiq Khalid Lodhi2, Osman Hasan2, Muhammad Shafique3 and Joerg Henkel4
1National University of Sciences and Technology - School of Electrical Engineering and Computer Science, PK; 2School of Electrical Engineering and Computer Science National University of Sciences and Technology (NUST), PK; 3Vienna University of Technology (TU Wien), AT; 4Karlsruhe Institute of Technology, DE
Abstract
Dynamic thermal management (DTM) techniques based on task migration provide a promising solution to mitigate thermal emergencies and thereby ensuring safe operation and reliability of Many-Core systems. These techniques can be classified as central or distributed on the basis of a central DTM controller for the whole system or individual DTM controllers for each core or set of cores in the system, respectively. However, having a trustworthy comparison between central (c-) and distributed (d-) DTM techniques to find out the most suitable one for a given system is quite challenging. This is primarily due to the systemic difference between cDTM and dDTM controllers, and the inherent non-exhaustiveness of simulation and emulation methods conventionally used for DTM analysis. In this paper, we present a novel methodology called CAnDy-TM (stands for Comparative Analysis of Dynamic Thermal Management) that employs Model Checking to perform formal comparative analysis for cDTM and dDTM techniques. We identify a set of generic functional and performance properties to provide a common ground for their comparison. We demonstrate the usability and benefits of our methodology by comparing state-of-the-art cDTM and dDTM techniques, and illustrate which technique is good w.r.t. thermal stability and other task migration parameters. Such an analysis helps in selecting the most appropriate DTM for a given chip.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-18POWER PRE-CHARACTERIZED MESHING ALGORITHM FOR FINITE ELEMENT THERMAL ANALYSIS OF INTEGRATED CIRCUITS
Speaker:
Shohdy Abdelkader, Software Developer, EG
Authors:
Shohdy Abdelkader1, Alaa ElRouby2 and Mohamed Dessouky1
1Mentor, EG; 2Electric and Electronic Department, Faculty of Engineering and Natural Science, Yildirim Beyazit University, TR
Abstract
In this paper we present an adaptive meshing technique suitable for steady state finite element (FE) based thermal analysis of integrated circuits (ICs). The algorithm presented is a non iterative one where the technology used is first pre-characterized. The characterization results are then used for scanning the layout to detect high power regions then fine meshing them. Finally, the analysis is done only once. This makes it faster than conventional iterative adaptive meshing methods. The algorithm results showed comparable accuracy and better performance when compared to the flux based (iterative) and the power aware (non iterative) algorithms.

Download Paper (PDF; Only available from the DATE venue WiFi)