IP4 Interactive Presentations

Label	Presentation Title Authors
IP4-1	1024-CHANNEL 3D ULTRASOUND DIGITAL BEAMFORMER IN A SINGLE 5W FPGA Speaker: Aya Ibrahim, EPFL, CH Authors: Federico Angiolini¹, Aya Ibrahim¹, William Simon¹, Ahmet Caner Yüzügüler¹, Marcel Arditi¹, Jean-Philippe Thiran¹ and Giovanni De Micheli² ¹EPFL, CH; ²École Polytechnique Fédérale de Lausanne (EPFL), CH Abstract 3D ultrasound, an emerging medical imaging tech- nique that is presently only used in hospitals, has the potential to enable breakthrough telemedicine applications, provided that its cost and power dissipation can be minimized. In this paper, we present an FPGA architecture suitable for a portable medical 3D ultrasound device. We show an optimized design for the digital part of the imager, including the delay calculation block, which is its most critical part. Our computationally efficient approach requires a single FPGA for 3D imaging, which is unprecedented. The design is scalable; a configuration supporting a 32×32- channel probe, which enables high-quality imaging, consumes only about 5W. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-2	LAANT: A LIBRARY TO AUTOMATICALLY OPTIMIZE EDP FOR OPENMP APPLICATIONS Speaker: Arthur Francisco Lorenzon, Federal University of Rio Grande do Sul, BR Authors: Arthur Lorenzon, Jeckson Dellagostin Souza and Antonio Carlos Schneider Beck Filho, Universidade Federal do Rio Grande do Sul, BR Abstract Efficiently exploiting thread level parallelism from new multicore systems has been challenging for software developers. While blindly increasing the number of threads may lead to performance gains, it can also result in disproportionate increase in energy consumption. For this reason, rightly choosing the number of threads is essential to reach the best compromise between both. However, such task is extremely difficult: besides the huge number of variables involved, many of them will change according to different aspects of the system at hand and are only possible to be defined at run-time. To address this complex scenario, we propose LAANT, a novel library to automatically find the optimal number of threads for OpenMP applications, by dynamically considering their particular characteristics, input set, and the processor architecture. By executing nine well-known benchmarks on three real multicore processors, LAANT improves the EDP (Energy-Delay Product) by up to 61%, compared to the standard OpenMP execution; and by 44%, when the dynamic adjustment of the number of threads of OpenMP is activated. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-3	IMPROVING THE ACCURACY OF THE LEAKAGE POWER ESTIMATION OF EMBEDDED CPUS Speaker: Shiao-Li Tsao, National Chiao Tung University, TW Authors: Ting-Wu Chin, Shiao-Li Tsao, Kuo-Wei Hung and Pei-Shu Huang, National Chiao Tung University, TW Abstract Previous studies have used on-chip thermal sensors (diodes) to estimate the leakage power of a CPU. However, an embedded CPU equips only a few thermal sensors and may suffer from considerable spatial temperature variances across the CPU core, and leakage power estimation based on insufficient temperature information introduces errors. According to our experiments, the conventional leakage power models may have up to 22.9% estimation error for a 70-nm embedded CPU. In this study, we first evaluated the accuracy of leakage power estimates based on thermal sensors on different locations of a CPU and suggested locations that can reduce the error to 0.9%. Then, we proposed temperature-referred and counter-tracked estimation (TRACE) that relies on temperature sensors and hardware activity counters to estimate leakage power. The simulation results demonstrated that employing TRACE could reduce the error to 3.4%. Experiments were also conducted on a real platform to verify our findings. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-4	SCHEDULE-AWARE LOOP PARALLELIZATION FOR EMBEDDED MPSOCS BY EXPLOITING PARALLEL SLACK Speaker: Miguel Angel Aguilar, RWTH Aachen University, DE Authors: Miguel Angel Aguilar¹, Rainer Leupers¹, Gerd Ascheid¹, Nikolaos Kavvadias² and Liam Fitzpatrick² ¹RWTH Aachen University, DE; ²Silexica Software Solutions GmbH, DE Abstract MPSoC programming is still a challenging task, where several aspects have to be taken into account to achieve a profitable parallel execution. Selecting a proper scheduling policy is an aspect that has a major impact on the performance. OpenMP is an example of a programming paradigm that allows to specify the scheduling policy on a per loop basis. However, choosing the best scheduling policy and the corresponding parameters is not a trivial task. In fact, there is already a large amount of software parallelized with OpenMP, where the scheduling policy is not explicitly specified. Then, the scheduling decision is left to the default runtime, which in most of the cases does not yield the best performance. In this paper, we present a schedule-aware optimization approach enabled by exploiting the parallel slack existing in loops parallelized with OpenMP. Results on an embedded multicore device, show that the performance achieved by OpenMP loops optimized with our approach outperform by up to 93%, the performance achieved by the original OpenMP loops, where the scheduling policy is not specified. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-5	REDUCING CODE MANAGEMENT OVERHEAD IN SOFTWARE-MANAGED MULTICORES Speaker: Aviral Shrivastava, Arizona State University, US Authors: Jian Cai¹, Yooseong Kim¹, Youngbin Kim², Aviral Shrivastava¹ and Kyoungwoo Lee² ¹Arizona State University, US; ²Yonsei University, KR Abstract Software-managed architectures, which use scratch- pad memories (SPMs), are a promising alternative to cached- based architectures for multicores. SPMs provide scalability but require explicit management. For example, to use an instruction SPM, explicit management code needs to be inserted around every call site to load functions to the SPM. such management code would check the state of the SPM and perform loading operations if necessary, which can cause considerable overhead at runtime. In this paper, we propose a compiler-based approach to reduce this overhead by identifying management code that can be removed or simplified. Our experiments with various benchmarks show that our approach reduces the execution time by 14% on average. In addition, compared to hardware caching, using our approach on an SPM-based architecture can reduce the execution times of the benchmarks by up to 15% Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-6	PERFORMANCE EVALUATION AND OPTIMIZATION OF HBM-ENABLED GPU FOR DATA-INTENSIVE APPLICATIONS Speaker: Yuan Xie, University of California, Santa Barbara, US Authors: Maohua Zhu¹, Youwei Zhuo², Chao Wang³, Wenguang Chen⁴ and Yuan Xie¹ ¹University of California, Santa Barbara, US; ²University of Southern California, US; ³University of Science and Technology of China, CN; ⁴Tsinghua University, CN Abstract Graphics Processing Units (GPUs) are widely used to accelerate data-intensive applications. To improve the performance of data-intensive applications, higher GPU memory bandwidth is desirable. Traditional GDDR memories achieve higher bandwidth by increasing frequency, which leads to excessive power consumption. Recently, a new memory technology called high-bandwidth memory (HBM) based on 3D die-stacking technology has been used in the latest generation of GPUs, which can provide both high bandwidth and low power consumption with in-package stacked DRAM memory. However, the capacity of integrated in-packaged stacked memory is limited (e.g. only 4GB for the state-of-the-art HBM-enabled GPU, AMD Radeon Fury X). In this paper, we implement two representative data-intensive applications, convolutional neural network (CNN) and breadth-first search (BFS) on an HBM-enabled GPU to evaluate the improvement brought by the adoption of the HBM, and investigate techniques to fully unleash the benefits of such HBM-enabled GPU. Based on the evaluation results, we first propose a software pipeline to alleviate the capacity limitation of the HBM for CNN. We then design two programming techniques to improve the utilization of memory bandwidth for BFS application. Experiment results demonstrate that our pipelined CNN training achieves a 1.63x speedup on an HBM enabled GPU compared with the best high-performance GPU in market, and the two optimization techniques for the BFS algorithm make it at most 24.5x(9.8x and 2.5x for each technique, respectively) faster than conventional implementations. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-7	DAC: DEDUP-ASSISTED COMPRESSION SCHEME FOR IMPROVING LIFETIME OF NAND STORAGE SYSTEMS Speaker: Jisung Park, Seoul National University, KR Authors: Jisung Park¹, Sungjin Lee² and Jihong Kim¹ ¹Seoul National University, KR; ²Inha University, KR Abstract Thanks to an aggressive scaling of semiconductor devices, the capacity of NAND flash-based solid-state-drives (SSDs) has increased greatly. However, this benefit comes at the expense of a serious degradation of NAND device's lifetime. In order to improve the lifetime of flash-based SSDs, various data reduction techniques, such as deduplication, lossless compression, and delta compression, are rapidly adopted to SSDs. Although each technique has been extensively studied, how to efficiently combine these techniques for maximizing their synergy effects is not investigated well. In this paper, we propose a novel dedup-assisted compression (DAC) scheme that integrates existing data reduction techniques so that potential benefits of individual ones can be maximized while overcoming their inherent limitations. By doing so, DAC greatly reduces the amount of write traffic sent to SSDs. DAC also requires negligible hardware resources by utilizing existing hardware modules. Our evaluation results show that the proposed DAC decreases the amount of written data by up to 30% over a simple integration reduplication and lossless compression. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-8	LIFETIME ADAPTIVE ECC IN NAND FLASH PAGE MANAGEMENT Speaker: Shunzhuo Wang, Huazhong University of Science and Technology, CN Authors: Shunzhuo Wang¹, Fei Wu¹, Zhonghai Lu², You Zhou¹, Qin Xiong¹, Meng Zhang¹ and Changsheng Xie¹ ¹Huazhong University of Science and Technology, CN; ²KTH Royal Institute of Technology, SE Abstract With increasing density, NAND flash memory has decreasing reliability. Furthermore, raw bit error rate (RBER) of flash memory grows at an exponential rate as program/erase (P/E) cycle increases. Thus, error correction codes (ECCs), usually stored in the out-of-band area (OOB) of flash pages, are widely employed to ensure the reliability. However, the worstcase oriented ECC is largely under-utilized in the early stage, i.e. when P/E cycles are small, and the required ECC redundancy may be too large to be stored in the OOB. In this paper, we propose LAE-FTL, which employs a lifetimeadaptive ECC scheme, to improve the performance and lifetime of NAND flash memory. In the early stage, weak ECCs can guarantee the reliability and the OOB is large enough to store the ECCs. Thus, LAE-FTL employs weak ECCs and adaptively uses small and incremental codewords as P/E cycle increases to improve data transfer and decoding parallelism. In the late stage with large P/E cycles, strong ECCs are needed and the ECC redundancies become too large to fit in the OOB. Thus, LAE-FTL stores the exceeding ECC redundancies in the data space of flash pages and stores user data in a cross-page fashion. Finally, our evaluation results of trace-driven simulations show that LAE-FTL improves the read performance by up to 63.42%, compared to the worst-case oriented ECC scheme in the early stage, and significantly improve reliability of flash memory at low data accessing overhead in the late stage. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-9	3D-DPE: A 3D HIGH-BANDWIDTH DOT-PRODUCT ENGINE FOR HIGH-PERFORMANCE NEUROMORPHIC COMPUTING Speaker: Miguel Lastras-Montaño, University of California, Santa Barbara, US Authors: Miguel Angel Lastras-Montaño¹, Bhaswar Chakrabarti¹, Dmitri B. Strukov¹ and Kwang-Ting Cheng² ¹UC Santa Barbara, US; ²HKUST, HK Abstract We present and experimentally validate 3D-DPE, a general-purpose dot-product engine, which is ideal for accelerating artificial neural networks (ANNs). 3D-DPE is based on a monolithically integrated 3D CMOS-memristor hybrid circuit and performs a high-dimensional dot-product operation (a recurrent and computationally expensive operation in ANNs) within a single step, using analog current-based computing. 3D-DPE is made up of two subsystems, namely a CMOS subsystem serving as the memory controller and an analog memory subsystem consisting of multiple layers of high-density memristive crossbar arrays fabricated on top of the CMOS subsystem. Their integration is based on a high-density area-distributed interface, resulting in much higher connectivity between the two subsystems, compared to the traditional interface of a 2D system or a 3D system integrated using through silicon vias. As a result, 3D-DPE's single-step dot-product operation is not limited by the memory bandwidth, and the input dimension of the operations scales well with the capacity of the 3D memristive arrays. To demonstrate the feasibility of 3D-DPE, we designed and fabricated a CMOS memory controller and monolitically integrated 2 layers of titanium-oxide memristive crossbars. Then we performed the analog dot-product operation under different input conditions in two scenarios: (1) with devices within the same crossbar layer and (2) with devices from different layers. In both cases, the devices exhibited low voltage operation and analog switching behavior with high tuning accuracy. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-10	A SCHEDULABILITY TEST FOR SOFTWARE MIGRATION ON MULTICORE SYSTEMS Speaker: Jung-Eun Kim, Department of Computer Science at the University of Illinois at Urbana-Champaign, US Authors: Jung-Eun Kim¹, Richard Bradford², Tarek Abdelzaher³ and Lui Sha³ ¹Department of Computer Science, University of Illinois at Urbana-Champaign, US; ²Rockwell Collins, Cedar Rapids, IA, US; ³University of Illinois, US Abstract This paper presents a new schedulability test for safety-critical software undergoing a transition from single-core to multicore systems - a challenge faced by multiple industries today. Our migration model consists of a schedulability test and execution model. Its properties enable us to obtain a utilization bound that places an allowable limit on total task execution times. Evaluation results demonstrate the advantages of our scheduling model over competing resource partitioning approaches, such as Periodic Server and TDMA. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-11	ADAPTIVE POWER DELIVERY SYSTEM MANAGEMENT FOR MANY-CORE PROCESSORS WITH ON/OFF-CHIP VOLTAGE REGULATORS Speaker: Haoran Li, The Hong Kong University of Science and Technology, HK Authors: Haoran Li, Jiang Xu, Zhe Wang, Peng Yang, Rafael Kioji Vivas Maeda and Zhongyuan Tian, The Hong Kong University of Science and Technology, HK Abstract The power delivery system (PDS) plays a crucial role of guaranteeing the proper functionality of many-core processors. However, as PDS is usually optimized to provide power to the target chip at its best performance level, its energy efficiency can be seriously degraded under highly dynamic workloads, making it a major source of system power losses. On-chip voltage regulators (VR), which are able to achieve fast and fine-grained power control, have been popular choices for PDS implementation and provided design opportunities for improving system energy efficiency. In this paper, we propose the adaptive Quantized Power Management (QPM) scheme to dynamically adjust the PDS with both on-chip and off-chip VRs based on run-time workloads. Experimental results on different applications show that QPM applied on a hybrid PDS with both on/off-chip voltage regulators(VR) achieves 74.1% average overall energy efficiency, 12.3% higher than the conventional PDS with single off-chip VR. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-12	FLYING AND DECOUPLING CAPACITANCE OPTIMIZATION FOR AREA-CONSTRAINED ON-CHIP SWITCHED-CAPACITOR VOLTAGE REGULATORS Speaker: Xiaoyang Mi, Arizona State University, US Authors: Xiaoyang Mi¹, Hesam Fathi Moghadam² and Jae-sun Seo¹ ¹Arizona State University, US; ²Oracle Corporation, US Abstract Switched-capacitor voltage regulators (SCVRs) are widely used in on-chip power management, due to high step-down efficiency and feasibility of integration. In this work, we present theoretical analysis and optimization methodology for flying and decoupling capacitance values for area-constrained on-chip SCVRs to achieve the highest system-level power efficiency. The proposed models for efficiency and droop voltage are validated with on-chip 2:1 SCVR implementations in both 65nm and 32nm CMOS, which show high model accuracy. The maximum and average error of the predicted optimal ratio between flying and decoupling capacitance are 5% and 1.7%, respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-13	ENHANCING ANALOG YIELD OPTIMIZATION FOR VARIATION-AWARE CIRCUITS SIZING Speaker: Ons Lahiouel, Concordia University, CA Authors: Ons Lahiouel, Mohamed H. Zaki and Sofiene Tahar, Concordia University, CA Abstract This paper presents a novel approach for improving automated analog yield optimization using a two step exploration strategy. First, a global optimization phase relies on a modiﬁed Lipschitizian optimization to sample the potential optimal sub-regions of the feasible design space. The search locates a design point near the optimal solution that is used as a starting point by a local optimization phase. The local search constructs linear interpolating surrogate models of the yield to explore the basin of convergence and to rapidly reach the global optimum. Experimental results show that our approach locates higher quality design points in terms of yield rate within less run time and without affecting the accuracy. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-14	A NEW SAMPLING TECHNIQUE FOR MONTE CARLO-BASED STATISTICAL CIRCUIT ANALYSIS Speaker: Hiwa Mahmoudi, Vienna University of Technology, AT Authors: Hiwa Mahmoudi and Horst Zimmermann, Vienna University of Technology, AT Abstract Variability is a fundamental issue which gets exponentially worse as CMOS technology shrinks. Therefore, characterization of statistical variations has become an important part of the design phase. Monte Carlo-based simulation method is a standard technique for statistical analysis and modeling of integrated circuits. However, crude Monte Carlo sampling based on pseudorandom selection of parameter variations suffers from low convergence rates and thus, providing high accuracy is computationally expensive. In this work, we present an extensive study on the performance of two widely used techniques, Latin Hypercube and Low Discrepancy sampling methods, and compare their speed-up and accuracy performance properties. It is shown that these methods can exhibit a better efficiency as compared to the pseudorandom sampling but only in limited applications. Therefore, we propose a new sampling scheme that exploits the benefits of both methods by combining them. Through representative circuit examples, it is shown that the proposed sampling technique provides a major improvement in terms of computational effort and offers better properties as compared to each solely. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-15	AUTOMATIC TECHNOLOGY MIGRATION OF ANALOG IC DESIGNS USING GENERIC CELL LIBRARIES Speaker: Nuno Horta, Instituto de Telecomunicações / Instituto Superior Técnico, PT Authors: Jose Cachaco¹, Nuno Machado¹, Nuno Lourenco¹, Jorge Guilherme² and Nuno Horta³ ¹Instituto de Telecomunicacoes/Instituto Superior Tecnico, PT; ²Instituto de Telecomunicacoes/Instituto Politecnico de Tomar, PT; ³Instituto de Telecomunicações/Instituto Superior Técnico, PT Abstract This paper addresses the problem of automatic technology migration of analog IC designs. The proposed approach introduces a new level of abstraction, for EDA tools addressing analog IC design, allowing a systematic and effortless adaption of a design to a new technology. The new abstraction level is based on generic cell libraries, which includes topology and testbenches descriptions for specific circuit classes. The new approach is implemented and tested using a state-of-the-art multi-objective multi-constraint circuit-level optimization tool, and is validated for the sizing and optimization of continuous-time comparators, including technology migration between two different design nodes, respectively, XFAB 350 nm technology (XH035) and ATMEL 150 nm SOI technology (AT77K). Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-16	NOISE-SENSITIVE FEEDBACK LOOP IDENTIFICATION IN LINEAR TIME-VARYING ANALOG CIRCUITS Speaker: Peng Li, Texas A&M University, US Authors: Ang Li¹, Peng Li¹, Tingwen Huang² and Edgar Sánchez-Sinencio¹ ¹Texas A&M University, US; ²Texas A&M University at Qatar, QA Abstract The continuing scaling of VLSI technology and design complexity has rendered robustness of analog circuits a significant concern. Parasitic effects may introduce unexpected marginal instability within multiple noise-sensitive loops and hence jeopardize circuit operation and processing precision. The Loop Finder algorithm has been recently proposed to allow detection of noise-sensitive return loops for circuits that are described using a linear time-invariant (LTI) system model. However, many practical circuits such as switched-capacitor filters and mixers present time-varying behaviors which are intrinsically coupled with noise propagation and introduce new noise generation mechanisms. For the first time, we take an in-depth look into the marginal instability of linear periodically time-varying (LPTV) analog circuits and further develop an algorithm for efficient identification of noise-sensitive loops, unifying the solution to noise sensitivity analysis for both LTI and LPTV circuits. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-17	CANDY-TM: COMPARATIVE ANALYSIS OF DYNAMIC THERMAL MANAGEMENT IN MANY-CORES USING MODEL CHECKING Speaker: Muhammad Shafique, Institute of Computer Engineering, Vienna University of Technology (TU Wien), AT Authors: Syed Ali Asadullah Bukhari¹, Faiq Khalid Lodhi², Osman Hasan², Muhammad Shafique³ and Joerg Henkel⁴ ¹National University of Sciences and Technology - School of Electrical Engineering and Computer Science, PK; ²School of Electrical Engineering and Computer Science National University of Sciences and Technology (NUST), PK; ³Vienna University of Technology (TU Wien), AT; ⁴Karlsruhe Institute of Technology, DE Abstract Dynamic thermal management (DTM) techniques based on task migration provide a promising solution to mitigate thermal emergencies and thereby ensuring safe operation and reliability of Many-Core systems. These techniques can be classified as central or distributed on the basis of a central DTM controller for the whole system or individual DTM controllers for each core or set of cores in the system, respectively. However, having a trustworthy comparison between central (c-) and distributed (d-) DTM techniques to find out the most suitable one for a given system is quite challenging. This is primarily due to the systemic difference between cDTM and dDTM controllers, and the inherent non-exhaustiveness of simulation and emulation methods conventionally used for DTM analysis. In this paper, we present a novel methodology called CAnDy-TM (stands for Comparative Analysis of Dynamic Thermal Management) that employs Model Checking to perform formal comparative analysis for cDTM and dDTM techniques. We identify a set of generic functional and performance properties to provide a common ground for their comparison. We demonstrate the usability and benefits of our methodology by comparing state-of-the-art cDTM and dDTM techniques, and illustrate which technique is good w.r.t. thermal stability and other task migration parameters. Such an analysis helps in selecting the most appropriate DTM for a given chip. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-18	POWER PRE-CHARACTERIZED MESHING ALGORITHM FOR FINITE ELEMENT THERMAL ANALYSIS OF INTEGRATED CIRCUITS Speaker: Shohdy Abdelkader, Software Developer, EG Authors: Shohdy Abdelkader¹, Alaa ElRouby² and Mohamed Dessouky¹ ¹Mentor, EG; ²Electric and Electronic Department, Faculty of Engineering and Natural Science, Yildirim Beyazit University, TR Abstract In this paper we present an adaptive meshing technique suitable for steady state finite element (FE) based thermal analysis of integrated circuits (ICs). The algorithm presented is a non iterative one where the technology used is first pre-characterized. The characterization results are then used for scanning the layout to detect high power regions then fine meshing them. Finally, the analysis is done only once. This makes it faster than conventional iterative adaptive meshing methods. The algorithm results showed comparable accuracy and better performance when compared to the flux based (iterative) and the power aware (non iterative) algorithms. Download Paper (PDF; Only available from the DATE venue WiFi)

Label

Presentation Title
Authors

IP4-1

1024-CHANNEL 3D ULTRASOUND DIGITAL BEAMFORMER IN A SINGLE 5W FPGA
Speaker:
Aya Ibrahim, EPFL, CH
Authors:
Federico Angiolini¹, Aya Ibrahim¹, William Simon¹, Ahmet Caner Yüzügüler¹, Marcel Arditi¹, Jean-Philippe Thiran¹ and Giovanni De Micheli²
¹EPFL, CH; ²École Polytechnique Fédérale de Lausanne (EPFL), CH
Abstract
3D ultrasound, an emerging medical imaging tech- nique that is presently only used in hospitals, has the potential to enable breakthrough telemedicine applications, provided that its cost and power dissipation can be minimized. In this paper, we present an FPGA architecture suitable for a portable medical 3D ultrasound device. We show an optimized design for the digital part of the imager, including the delay calculation block, which is its most critical part. Our computationally efficient approach requires a single FPGA for 3D imaging, which is unprecedented. The design is scalable; a configuration supporting a 32×32- channel probe, which enables high-quality imaging, consumes only about 5W.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-2

LAANT: A LIBRARY TO AUTOMATICALLY OPTIMIZE EDP FOR OPENMP APPLICATIONS
Speaker:
Arthur Francisco Lorenzon, Federal University of Rio Grande do Sul, BR
Authors:
Arthur Lorenzon, Jeckson Dellagostin Souza and Antonio Carlos Schneider Beck Filho, Universidade Federal do Rio Grande do Sul, BR
Abstract
Efficiently exploiting thread level parallelism from new multicore systems has been challenging for software developers. While blindly increasing the number of threads may lead to performance gains, it can also result in disproportionate increase in energy consumption. For this reason, rightly choosing the number of threads is essential to reach the best compromise between both. However, such task is extremely difficult: besides the huge number of variables involved, many of them will change according to different aspects of the system at hand and are only possible to be defined at run-time. To address this complex scenario, we propose LAANT, a novel library to automatically find the optimal number of threads for OpenMP applications, by dynamically considering their particular characteristics, input set, and the processor architecture. By executing nine well-known benchmarks on three real multicore processors, LAANT improves the EDP (Energy-Delay Product) by up to 61%, compared to the standard OpenMP execution; and by 44%, when the dynamic adjustment of the number of threads of OpenMP is activated.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-3

IMPROVING THE ACCURACY OF THE LEAKAGE POWER ESTIMATION OF EMBEDDED CPUS
Speaker:
Shiao-Li Tsao, National Chiao Tung University, TW
Authors:
Ting-Wu Chin, Shiao-Li Tsao, Kuo-Wei Hung and Pei-Shu Huang, National Chiao Tung University, TW
Abstract
Previous studies have used on-chip thermal sensors (diodes) to estimate the leakage power of a CPU. However, an embedded CPU equips only a few thermal sensors and may suffer from considerable spatial temperature variances across the CPU core, and leakage power estimation based on insufficient temperature information introduces errors. According to our experiments, the conventional leakage power models may have up to 22.9% estimation error for a 70-nm embedded CPU. In this study, we first evaluated the accuracy of leakage power estimates based on thermal sensors on different locations of a CPU and suggested locations that can reduce the error to 0.9%. Then, we proposed temperature-referred and counter-tracked estimation (TRACE) that relies on temperature sensors and hardware activity counters to estimate leakage power. The simulation results demonstrated that employing TRACE could reduce the error to 3.4%. Experiments were also conducted on a real platform to verify our findings.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-4

SCHEDULE-AWARE LOOP PARALLELIZATION FOR EMBEDDED MPSOCS BY EXPLOITING PARALLEL SLACK
Speaker:
Miguel Angel Aguilar, RWTH Aachen University, DE
Authors:
Miguel Angel Aguilar¹, Rainer Leupers¹, Gerd Ascheid¹, Nikolaos Kavvadias² and Liam Fitzpatrick²
¹RWTH Aachen University, DE; ²Silexica Software Solutions GmbH, DE
Abstract
MPSoC programming is still a challenging task, where several aspects have to be taken into account to achieve a profitable parallel execution. Selecting a proper scheduling policy is an aspect that has a major impact on the performance. OpenMP is an example of a programming paradigm that allows to specify the scheduling policy on a per loop basis. However, choosing the best scheduling policy and the corresponding parameters is not a trivial task. In fact, there is already a large amount of software parallelized with OpenMP, where the scheduling policy is not explicitly specified. Then, the scheduling decision is left to the default runtime, which in most of the cases does not yield the best performance. In this paper, we present a schedule-aware optimization approach enabled by exploiting the parallel slack existing in loops parallelized with OpenMP. Results on an embedded multicore device, show that the performance achieved by OpenMP loops optimized with our approach outperform by up to 93%, the performance achieved by the original OpenMP loops, where the scheduling policy is not specified.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-5

REDUCING CODE MANAGEMENT OVERHEAD IN SOFTWARE-MANAGED MULTICORES
Speaker:
Aviral Shrivastava, Arizona State University, US
Authors:
Jian Cai¹, Yooseong Kim¹, Youngbin Kim², Aviral Shrivastava¹ and Kyoungwoo Lee²
¹Arizona State University, US; ²Yonsei University, KR
Abstract
Software-managed architectures, which use scratch- pad memories (SPMs), are a promising alternative to cached- based architectures for multicores. SPMs provide scalability but require explicit management. For example, to use an instruction SPM, explicit management code needs to be inserted around every call site to load functions to the SPM. such management code would check the state of the SPM and perform loading operations if necessary, which can cause considerable overhead at runtime. In this paper, we propose a compiler-based approach to reduce this overhead by identifying management code that can be removed or simplified. Our experiments with various benchmarks show that our approach reduces the execution time by 14% on average. In addition, compared to hardware caching, using our approach on an SPM-based architecture can reduce the execution times of the benchmarks by up to 15%
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-6

PERFORMANCE EVALUATION AND OPTIMIZATION OF HBM-ENABLED GPU FOR DATA-INTENSIVE APPLICATIONS
Speaker:
Yuan Xie, University of California, Santa Barbara, US
Authors:
Maohua Zhu¹, Youwei Zhuo², Chao Wang³, Wenguang Chen⁴ and Yuan Xie¹
¹University of California, Santa Barbara, US; ²University of Southern California, US; ³University of Science and Technology of China, CN; ⁴Tsinghua University, CN
Abstract
Graphics Processing Units (GPUs) are widely used to accelerate data-intensive applications. To improve the performance of data-intensive applications, higher GPU memory bandwidth is desirable. Traditional GDDR memories achieve higher bandwidth by increasing frequency, which leads to excessive power consumption. Recently, a new memory technology called high-bandwidth memory (HBM) based on 3D die-stacking technology has been used in the latest generation of GPUs, which can provide both high bandwidth and low power consumption with in-package stacked DRAM memory. However, the capacity of integrated in-packaged stacked memory is limited (e.g. only 4GB for the state-of-the-art HBM-enabled GPU, AMD Radeon Fury X). In this paper, we implement two representative data-intensive applications, convolutional neural network (CNN) and breadth-first search (BFS) on an HBM-enabled GPU to evaluate the improvement brought by the adoption of the HBM, and investigate techniques to fully unleash the benefits of such HBM-enabled GPU. Based on the evaluation results, we first propose a software pipeline to alleviate the capacity limitation of the HBM for CNN. We then design two programming techniques to improve the utilization of memory bandwidth for BFS application. Experiment results demonstrate that our pipelined CNN training achieves a 1.63x speedup on an HBM enabled GPU compared with the best high-performance GPU in market, and the two optimization techniques for the BFS algorithm make it at most 24.5x(9.8x and 2.5x for each technique, respectively) faster than conventional implementations.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-7

DAC: DEDUP-ASSISTED COMPRESSION SCHEME FOR IMPROVING LIFETIME OF NAND STORAGE SYSTEMS
Speaker:
Jisung Park, Seoul National University, KR
Authors:
Jisung Park¹, Sungjin Lee² and Jihong Kim¹
¹Seoul National University, KR; ²Inha University, KR
Abstract
Thanks to an aggressive scaling of semiconductor devices, the capacity of NAND flash-based solid-state-drives (SSDs) has increased greatly. However, this benefit comes at the expense of a serious degradation of NAND device's lifetime. In order to improve the lifetime of flash-based SSDs, various data reduction techniques, such as deduplication, lossless compression, and delta compression, are rapidly adopted to SSDs. Although each technique has been extensively studied, how to efficiently combine these techniques for maximizing their synergy effects is not investigated well. In this paper, we propose a novel dedup-assisted compression (DAC) scheme that integrates existing data reduction techniques so that potential benefits of individual ones can be maximized while overcoming their inherent limitations. By doing so, DAC greatly reduces the amount of write traffic sent to SSDs. DAC also requires negligible hardware resources by utilizing existing hardware modules. Our evaluation results show that the proposed DAC decreases the amount of written data by up to 30% over a simple integration reduplication and lossless compression.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-8

LIFETIME ADAPTIVE ECC IN NAND FLASH PAGE MANAGEMENT
Speaker:
Shunzhuo Wang, Huazhong University of Science and Technology, CN
Authors:
Shunzhuo Wang¹, Fei Wu¹, Zhonghai Lu², You Zhou¹, Qin Xiong¹, Meng Zhang¹ and Changsheng Xie¹
¹Huazhong University of Science and Technology, CN; ²KTH Royal Institute of Technology, SE
Abstract
With increasing density, NAND flash memory has decreasing reliability. Furthermore, raw bit error rate (RBER) of flash memory grows at an exponential rate as program/erase (P/E) cycle increases. Thus, error correction codes (ECCs), usually stored in the out-of-band area (OOB) of flash pages, are widely employed to ensure the reliability. However, the worstcase oriented ECC is largely under-utilized in the early stage, i.e. when P/E cycles are small, and the required ECC redundancy may be too large to be stored in the OOB. In this paper, we propose LAE-FTL, which employs a lifetimeadaptive ECC scheme, to improve the performance and lifetime of NAND flash memory. In the early stage, weak ECCs can guarantee the reliability and the OOB is large enough to store the ECCs. Thus, LAE-FTL employs weak ECCs and adaptively uses small and incremental codewords as P/E cycle increases to improve data transfer and decoding parallelism. In the late stage with large P/E cycles, strong ECCs are needed and the ECC redundancies become too large to fit in the OOB. Thus, LAE-FTL stores the exceeding ECC redundancies in the data space of flash pages and stores user data in a cross-page fashion. Finally, our evaluation results of trace-driven simulations show that LAE-FTL improves the read performance by up to 63.42%, compared to the worst-case oriented ECC scheme in the early stage, and significantly improve reliability of flash memory at low data accessing overhead in the late stage.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-9

3D-DPE: A 3D HIGH-BANDWIDTH DOT-PRODUCT ENGINE FOR HIGH-PERFORMANCE NEUROMORPHIC COMPUTING
Speaker:
Miguel Lastras-Montaño, University of California, Santa Barbara, US
Authors:
Miguel Angel Lastras-Montaño¹, Bhaswar Chakrabarti¹, Dmitri B. Strukov¹ and Kwang-Ting Cheng²
¹UC Santa Barbara, US; ²HKUST, HK
Abstract
We present and experimentally validate 3D-DPE, a general-purpose dot-product engine, which is ideal for accelerating artificial neural networks (ANNs). 3D-DPE is based on a monolithically integrated 3D CMOS-memristor hybrid circuit and performs a high-dimensional dot-product operation (a recurrent and computationally expensive operation in ANNs) within a single step, using analog current-based computing. 3D-DPE is made up of two subsystems, namely a CMOS subsystem serving as the memory controller and an analog memory subsystem consisting of multiple layers of high-density memristive crossbar arrays fabricated on top of the CMOS subsystem. Their integration is based on a high-density area-distributed interface, resulting in much higher connectivity between the two subsystems, compared to the traditional interface of a 2D system or a 3D system integrated using through silicon vias. As a result, 3D-DPE's single-step dot-product operation is not limited by the memory bandwidth, and the input dimension of the operations scales well with the capacity of the 3D memristive arrays. To demonstrate the feasibility of 3D-DPE, we designed and fabricated a CMOS memory controller and monolitically integrated 2 layers of titanium-oxide memristive crossbars. Then we performed the analog dot-product operation under different input conditions in two scenarios: (1) with devices within the same crossbar layer and (2) with devices from different layers. In both cases, the devices exhibited low voltage operation and analog switching behavior with high tuning accuracy.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-10

A SCHEDULABILITY TEST FOR SOFTWARE MIGRATION ON MULTICORE SYSTEMS
Speaker:
Jung-Eun Kim, Department of Computer Science at the University of Illinois at Urbana-Champaign, US
Authors:
Jung-Eun Kim¹, Richard Bradford², Tarek Abdelzaher³ and Lui Sha³
¹Department of Computer Science, University of Illinois at Urbana-Champaign, US; ²Rockwell Collins, Cedar Rapids, IA, US; ³University of Illinois, US
Abstract
This paper presents a new schedulability test for safety-critical software undergoing a transition from single-core to multicore systems - a challenge faced by multiple industries today. Our migration model consists of a schedulability test and execution model. Its properties enable us to obtain a utilization bound that places an allowable limit on total task execution times. Evaluation results demonstrate the advantages of our scheduling model over competing resource partitioning approaches, such as Periodic Server and TDMA.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-11

ADAPTIVE POWER DELIVERY SYSTEM MANAGEMENT FOR MANY-CORE PROCESSORS WITH ON/OFF-CHIP VOLTAGE REGULATORS
Speaker:
Haoran Li, The Hong Kong University of Science and Technology, HK
Authors:
Haoran Li, Jiang Xu, Zhe Wang, Peng Yang, Rafael Kioji Vivas Maeda and Zhongyuan Tian, The Hong Kong University of Science and Technology, HK
Abstract
The power delivery system (PDS) plays a crucial role of guaranteeing the proper functionality of many-core processors. However, as PDS is usually optimized to provide power to the target chip at its best performance level, its energy efficiency can be seriously degraded under highly dynamic workloads, making it a major source of system power losses. On-chip voltage regulators (VR), which are able to achieve fast and fine-grained power control, have been popular choices for PDS implementation and provided design opportunities for improving system energy efficiency. In this paper, we propose the adaptive Quantized Power Management (QPM) scheme to dynamically adjust the PDS with both on-chip and off-chip VRs based on run-time workloads. Experimental results on different applications show that QPM applied on a hybrid PDS with both on/off-chip voltage regulators(VR) achieves 74.1% average overall energy efficiency, 12.3% higher than the conventional PDS with single off-chip VR.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-12

FLYING AND DECOUPLING CAPACITANCE OPTIMIZATION FOR AREA-CONSTRAINED ON-CHIP SWITCHED-CAPACITOR VOLTAGE REGULATORS
Speaker:
Xiaoyang Mi, Arizona State University, US
Authors:
Xiaoyang Mi¹, Hesam Fathi Moghadam² and Jae-sun Seo¹
¹Arizona State University, US; ²Oracle Corporation, US
Abstract
Switched-capacitor voltage regulators (SCVRs) are widely used in on-chip power management, due to high step-down efficiency and feasibility of integration. In this work, we present theoretical analysis and optimization methodology for flying and decoupling capacitance values for area-constrained on-chip SCVRs to achieve the highest system-level power efficiency. The proposed models for efficiency and droop voltage are validated with on-chip 2:1 SCVR implementations in both 65nm and 32nm CMOS, which show high model accuracy. The maximum and average error of the predicted optimal ratio between flying and decoupling capacitance are 5% and 1.7%, respectively.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-13

ENHANCING ANALOG YIELD OPTIMIZATION FOR VARIATION-AWARE CIRCUITS SIZING
Speaker:
Ons Lahiouel, Concordia University, CA
Authors:
Ons Lahiouel, Mohamed H. Zaki and Sofiene Tahar, Concordia University, CA
Abstract
This paper presents a novel approach for improving automated analog yield optimization using a two step exploration strategy. First, a global optimization phase relies on a modiﬁed Lipschitizian optimization to sample the potential optimal sub-regions of the feasible design space. The search locates a design point near the optimal solution that is used as a starting point by a local optimization phase. The local search constructs linear interpolating surrogate models of the yield to explore the basin of convergence and to rapidly reach the global optimum. Experimental results show that our approach locates higher quality design points in terms of yield rate within less run time and without affecting the accuracy.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-14

A NEW SAMPLING TECHNIQUE FOR MONTE CARLO-BASED STATISTICAL CIRCUIT ANALYSIS
Speaker:
Hiwa Mahmoudi, Vienna University of Technology, AT
Authors:
Hiwa Mahmoudi and Horst Zimmermann, Vienna University of Technology, AT
Abstract
Variability is a fundamental issue which gets exponentially worse as CMOS technology shrinks. Therefore, characterization of statistical variations has become an important part of the design phase. Monte Carlo-based simulation method is a standard technique for statistical analysis and modeling of integrated circuits. However, crude Monte Carlo sampling based on pseudorandom selection of parameter variations suffers from low convergence rates and thus, providing high accuracy is computationally expensive. In this work, we present an extensive study on the performance of two widely used techniques, Latin Hypercube and Low Discrepancy sampling methods, and compare their speed-up and accuracy performance properties. It is shown that these methods can exhibit a better efficiency as compared to the pseudorandom sampling but only in limited applications. Therefore, we propose a new sampling scheme that exploits the benefits of both methods by combining them. Through representative circuit examples, it is shown that the proposed sampling technique provides a major improvement in terms of computational effort and offers better properties as compared to each solely.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-15

AUTOMATIC TECHNOLOGY MIGRATION OF ANALOG IC DESIGNS USING GENERIC CELL LIBRARIES
Speaker:
Nuno Horta, Instituto de Telecomunicações / Instituto Superior Técnico, PT
Authors:
Jose Cachaco¹, Nuno Machado¹, Nuno Lourenco¹, Jorge Guilherme² and Nuno Horta³
¹Instituto de Telecomunicacoes/Instituto Superior Tecnico, PT; ²Instituto de Telecomunicacoes/Instituto Politecnico de Tomar, PT; ³Instituto de Telecomunicações/Instituto Superior Técnico, PT
Abstract
This paper addresses the problem of automatic technology migration of analog IC designs. The proposed approach introduces a new level of abstraction, for EDA tools addressing analog IC design, allowing a systematic and effortless adaption of a design to a new technology. The new abstraction level is based on generic cell libraries, which includes topology and testbenches descriptions for specific circuit classes. The new approach is implemented and tested using a state-of-the-art multi-objective multi-constraint circuit-level optimization tool, and is validated for the sizing and optimization of continuous-time comparators, including technology migration between two different design nodes, respectively, XFAB 350 nm technology (XH035) and ATMEL 150 nm SOI technology (AT77K).
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-16

NOISE-SENSITIVE FEEDBACK LOOP IDENTIFICATION IN LINEAR TIME-VARYING ANALOG CIRCUITS
Speaker:
Peng Li, Texas A&M University, US
Authors:
Ang Li¹, Peng Li¹, Tingwen Huang² and Edgar Sánchez-Sinencio¹
¹Texas A&M University, US; ²Texas A&M University at Qatar, QA
Abstract
The continuing scaling of VLSI technology and design complexity has rendered robustness of analog circuits a significant concern. Parasitic effects may introduce unexpected marginal instability within multiple noise-sensitive loops and hence jeopardize circuit operation and processing precision. The Loop Finder algorithm has been recently proposed to allow detection of noise-sensitive return loops for circuits that are described using a linear time-invariant (LTI) system model. However, many practical circuits such as switched-capacitor filters and mixers present time-varying behaviors which are intrinsically coupled with noise propagation and introduce new noise generation mechanisms. For the first time, we take an in-depth look into the marginal instability of linear periodically time-varying (LPTV) analog circuits and further develop an algorithm for efficient identification of noise-sensitive loops, unifying the solution to noise sensitivity analysis for both LTI and LPTV circuits.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-17

CANDY-TM: COMPARATIVE ANALYSIS OF DYNAMIC THERMAL MANAGEMENT IN MANY-CORES USING MODEL CHECKING
Speaker:
Muhammad Shafique, Institute of Computer Engineering, Vienna University of Technology (TU Wien), AT
Authors:
Syed Ali Asadullah Bukhari¹, Faiq Khalid Lodhi², Osman Hasan², Muhammad Shafique³ and Joerg Henkel⁴
¹National University of Sciences and Technology - School of Electrical Engineering and Computer Science, PK; ²School of Electrical Engineering and Computer Science National University of Sciences and Technology (NUST), PK; ³Vienna University of Technology (TU Wien), AT; ⁴Karlsruhe Institute of Technology, DE
Abstract
Dynamic thermal management (DTM) techniques based on task migration provide a promising solution to mitigate thermal emergencies and thereby ensuring safe operation and reliability of Many-Core systems. These techniques can be classified as central or distributed on the basis of a central DTM controller for the whole system or individual DTM controllers for each core or set of cores in the system, respectively. However, having a trustworthy comparison between central (c-) and distributed (d-) DTM techniques to find out the most suitable one for a given system is quite challenging. This is primarily due to the systemic difference between cDTM and dDTM controllers, and the inherent non-exhaustiveness of simulation and emulation methods conventionally used for DTM analysis. In this paper, we present a novel methodology called CAnDy-TM (stands for Comparative Analysis of Dynamic Thermal Management) that employs Model Checking to perform formal comparative analysis for cDTM and dDTM techniques. We identify a set of generic functional and performance properties to provide a common ground for their comparison. We demonstrate the usability and benefits of our methodology by comparing state-of-the-art cDTM and dDTM techniques, and illustrate which technique is good w.r.t. thermal stability and other task migration parameters. Such an analysis helps in selecting the most appropriate DTM for a given chip.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-18

POWER PRE-CHARACTERIZED MESHING ALGORITHM FOR FINITE ELEMENT THERMAL ANALYSIS OF INTEGRATED CIRCUITS
Speaker:
Shohdy Abdelkader, Software Developer, EG
Authors:
Shohdy Abdelkader¹, Alaa ElRouby² and Mohamed Dessouky¹
¹Mentor, EG; ²Electric and Electronic Department, Faculty of Engineering and Natural Science, Yildirim Beyazit University, TR
Abstract
In this paper we present an adaptive meshing technique suitable for steady state finite element (FE) based thermal analysis of integrated circuits (ICs). The algorithm presented is a non iterative one where the technology used is first pre-characterized. The characterization results are then used for scanning the layout to detect high power regions then fine meshing them. Finally, the analysis is done only once. This makes it faster than conventional iterative adaptive meshing methods. The algorithm results showed comparable accuracy and better performance when compared to the flux based (iterative) and the power aware (non iterative) algorithms.
Download Paper (PDF; Only available from the DATE venue WiFi)

available at

Visit us at DATE 2017

Booth: 20+21

Booth: 30

Booth: 17

Booth: 26

Booth: 1

Booth: 23

Submissions

IP4 Interactive Presentations

DATE Smartphone App

Visit us at DATE 2017