IP4 Interactive Presentations

Label	Presentation Title Authors
IP4-1	HIT: A HIDDEN INSTRUCTION TROJAN MODEL FOR PROCESSORS Speaker: Jiaqi Zhang, Tongji University, CN Authors: Jiaqi Zhang¹, Ying Zhang¹, Huawei Li² and Jianhui Jiang³ ¹Tongji University, CN; ²Chinese Academy of Sciences, CN; ³School of Software Engineering, Tongji University, CN Abstract This paper explores an intrusion mechanism to microprocessors using illegal instructions, namely hidden instruction Trojan (HIT). It uses a low-probability sequence consisting of normal instructions as a boot sequence, followed by an illegal instruction to trigger the Trojan. The payload is a hidden interrupt to force the program counter to a specific address. Hence the program at the address has the super privileges. Meanwhile, we use integer programming to minimize the trigger probability of HIT within a given area overhead. The experimental results demonstrate that HIT has an extremely low trigger probability and can survive from the detection of the existing test methods. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-2	BITSTREAM MODIFICATION ATTACK ON SNOW 3G Speaker: Michail Moraitis, Royal Institute of Technology KTH, SE Authors: Michail Moraitis and Elena Dubrova, Royal Institute of Technology - KTH, SE Abstract SNOW 3G is one of the core algorithms for confidentiality and integrity in several 3GPP wireless communication standards, including the new Next Generation (NG) 5G. It is believed to be resistant to classical cryptanalysis. In this paper, we show that SNOW 3G can be broken by a fault attack based on bitstream modification. By changing the content of some look-up tables in the bitstream, we reduce the non-linear state updating function of SNOW 3G to a linear one. As a result, it becomes possible to recover the key from a known plaintext-ciphertext pair. To our best knowledge, this is the first successful bitstream modification attack on SNOW 3G. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-3	A MACHINE LEARNING BASED WRITE POLICY FOR SSD CACHE IN CLOUD BLOCK STORAGE Speaker: Yu Zhang, Huazhong University of Science & Technology, CN Authors: Yu Zhang¹, Ke Zhou¹, Ping Huang², Hua Wang¹, Jianying Hu³, Yangtao Wang¹, Yongguang Ji³ and Bin Cheng³ ¹Huazhong University of Science & Technology, CN; ²Temple University, US; ³Tencent Technology (Shenzhen) Co., Ltd., CN Abstract Nowadays, SSD cache plays an important role in cloud storage systems. The associated write policy, which enforces an admission control policy regarding filling data into the cache, has a significant impact on the performance of the cache system and the amount of write traffic to SSD caches. Based on our analysis on a typical cloud block storage system, approximately 47.09% writes are write-only, i.e., writes to the blocks which have not been read during a certain time window. Naively writing the write-only data to the SSD cache unnecessarily introduces a large number of harmful writes to the SSD cache without any contribution to cache performance. On the other hand, it is a challenging task to identify and filter out those write-only data in a real-time manner, especially in a cloud environment running changing and diverse workloads. In this paper, to alleviate the above cache problem, we propose an ML-WP, Machine Learning Based Write Policy, which reduces write traffic to SSDs by avoiding writing write-only data. The main challenge in this approach is to identify write-only data in a real-time manner. To realize ML-WP and achieve accurate write-only data identification, we use machine learning methods to classify data into two groups (i.e., write-only and normal data). Based on this classification, the write-only data is directly written to backend storage without being cached. Experimental results show that, compared with the industry widely deployed write-back policy, ML-WP decreases write traffic to SSD cache by 41.52%, while improving the hit ratio by 2.61% and reducing the average read latency by 37.52%. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-4	YOU ONLY SEARCH ONCE: A FAST AUTOMATION FRAMEWORK FOR SINGLE-STAGE DNN/ACCELERATOR CO-DESIGN Speaker: Weiwei Chen, Chinese Academy of Sciences, CN Authors: Weiwei Chen, Ying Wang, Shuang Yang, Cheng Liu and Lei Zhang, Chinese Academy of Sciences, CN Abstract DNN/Accelerator co-design has shown great poten-tial in improving QoR and performance. Typical approaches separate the design flow into two-stage: (1) designing an application-specific DNN model with high accuracy; (2) building an accelerator considering the DNN specific characteristics. However, it may fails in promising the highest composite score which combines the goals of accuracy and other hardware-related constraints (e.g., latency, energy efficiency) when building a specific neural-network-based system. In this work, we present a single-stage automated framework, YOSO, aiming to generate the optimal solution of software-and-hardware that flexibly balances between the goal of accuracy, power, and QoS. Compared with the two-stage method on the baseline systolic array accelerator and Cifar10 dataset, we achieve 1.42x~2.29x energy or 1.79x~3.07x latency reduction at the same level of precision, for different user-specified energy and latency optimization constraints, respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-5	WHEN SORTING NETWORK MEETS PARALLEL BITSTREAMS: A FAULT-TOLERANT PARALLEL TERNARY NEURAL NETWORK ACCELERATOR BASED ON STOCHASTIC COMPUTING Speaker: Yawen Zhang, Peking University, CN Authors: Yawen Zhang¹, Sheng Lin², Runsheng Wang¹, Yanzhi Wang², Yuan Wang¹, Weikang Qian³ and Ru Huang¹ ¹Peking University, CN; ²Northeastern University, US; ³Shanghai Jiao Tong University, CN Abstract Stochastic computing (SC) has been widely used in neural networks (NNs) due to its simple hardware cost and high fault tolerance. Conventionally, SC-based NN accelerators adopt a hybrid stochastic-binary format, using an accumulative parallel counter to convert bitstreams into a binary number. This method, however, sacrifices the fault tolerance and causes a high hardware cost. In order to fully exploit the superior fault tolerance of SC, taking a ternary neural network (TNN) as an example, we propose a parallel SC-based NN accelerator purely using bitstream computation. We apply a bitonic sorting network for simultaneously implementing the accumulation and activation function with parallel bitstreams. The proposed design not only has high fault tolerance, but also achieves at least 2.8 energy efficiency improvement over the binary computing counterpart. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-6	WAVEPRO: CLOCK-LESS WAVE-PROPAGATED PIPELINE COMPILER FOR LOW-POWER AND HIGH-THROUGHPUT COMPUTATION Speaker: Yehuda Kra, Bar-Ilan University, IL Authors: Yehuda Kra, Adam Teman and Tzachi Noy, Bar-Ilan University, IL Abstract Clock-less Wave-Propagated Pipelining is a longknown approach to achieve high-throughput without the overhead of costly sampling registers. However, due to many design challenges, which have only increased with technology scaling, this approach has never been widely accepted and has generally been limited to small and very specific demonstrations. This paper addresses this barrier by presenting WavePro, a generic and scalable algorithm, capable of skew balancing any combinatorial logic netlist for the application of wave pipelining. The algorithm was implemented in the WavePro Compiler automation utility, which interfaces with industry delays extraction and standard timing analysis tools to produce a sign-off quality result. The utility is demonstrated upon a dot-product accelerator in a 65 nm CMOS technology, using a vendor-provided standard cell library and commercial timing analysis tools. By reducing the worstcase output skew by over 70%, the test case example was able to achieve equivalent throughput of an 8-staged sequentially pipelined implementation with power savings of almost 3X. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-7	DEEPNVM: A FRAMEWORK FOR MODELING AND ANALYSIS OF NON-VOLATILE MEMORY TECHNOLOGIES FOR DEEP LEARNING APPLICATIONS Speaker: Ahmet Inci, Carnegie Mellon University, US Authors: Ahmet Inci, Mehmet M Isgenc and Diana Marculescu, Carnegie Mellon University, US Abstract Non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While previous work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM, a framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. We present both iso-capacity and iso-area performance and energy analysis for systems whose last-level caches rely on conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 4.2x and 5x energy-delay product (EDP) reduction and 2.4x and 3x area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide 2.3x EDP reduction on average across all workloads when compared to SRAM. Our comprehensive cross-layer framework is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPU platforms for deep learning applications. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-8	EFFICIENT EMBEDDED MACHINE LEARNING APPLICATIONS USING ECHO STATE NETWORKS Speaker: Rolando Brondolin, Politecnico di Milano, IT Authors: Luca Cerina¹, Giuseppe Franco², Claudio Gallicchio³, Alessio Micheli³ and Marco D. Santambrogio⁴ ¹politecnico di milano, IT; ²Scuola Superiore Sant'Anna / Università di Pisa, IT; ³Università di Pisa, IT; ⁴Politecnico di Milano, IT Abstract The increasing role of Artificial Intelligence (AI) and Machine Learning (ML) in our lives brought a paradigm shift on how and where the computation is performed. Stringent latency requirements and congested bandwidth moved AI inference from Cloud space towards end-devices. This change required a major simplification of Deep Neural Networks (DNN), with memory-wise libraries or co-processors that perform fast inference with minimal power. Unfortunately, many applications such as natural language processing, time-series analysis and audio interpretation are built on a different type of Artifical Neural Networks (ANN), the so-called Recurrent Neural Networks (RNN), which, due to their intrinsic architecture, remains too complex and heavy to run efficiently on embedded devices. To solve this issue, the Reservoir Computing paradigm proposes sparse untrained non-linear networks, the Reservoir, that can embed temporal relations without some of the hindrances of Recurrent Neural Networks training, and with a lower memory usage. Echo State Networks (ESN) and Liquid State Machines are the most notable examples. In this scenario, we propose a performance comparison of a ESN, designed and trained using Bayesian Optimization techniques, against current RNN solutions. We aim to demonstrate that ESN have comparable performance in terms of accuracy, require minimal training time, and they are more optimized in terms of memory usage and computational efficiency. Preliminary results show that ESN are competitive with RNN on a simple benchmark, and both training and inference time are faster, with a maximum speed-up of 2.35x and 6.60x, respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-9	EXPLFRAME: EXPLOITING PAGE FRAME CACHE FOR FAULT ANALYSIS OF BLOCK CIPHERS Speaker: Anirban Chakraborty, IIT Kharagpur, IN Authors: Anirban Chakraborty¹, Sarani Bhattacharya², Sayandeep Saha¹ and Debdeep Mukhopadhyay¹ ¹IIT Kharagpur, IN; ²KU Leuven, BE Abstract Page Frame Cache (PFC) is a purely software cache, present in modern Linux based operating systems (OS), which stores the page frames that were recently released by the processes running on a particular CPU. In this paper, we show that the page frame cache can be maliciously exploited by an adversary to steer the pages of a victim process to some pre-decided attacker-chosen locations in the memory. We practically demonstrate an end-to-end attack, emph{ExplFrame}, where an attacker having only user-level privilege is able to force a victim process's memory pages to vulnerable locations in DRAM and deterministically conduct Rowhammer to induce faults. As a case study, we induce single bit faults in the T-tables on OpenSSL (v1.1.1) AES using our proposed attack ExplFrame. We also propose an improvised fault analysis technique which can exploit any Rowhammer-induced bit-flips in the AES T-tables. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-10	XGBIR: AN XGBOOST-BASED IR DROP PREDICTOR FOR POWER DELIVERY NETWORK Speaker: An-Yu Su, National Chiao Tung University, TW Authors: Chi-Hsien Pao, Yu-Min Lee and An-Yu Su, National Chiao Tung University, TW Abstract This work utilizes the XGBoost to build a machine-learning-based IR drop predictor, XGBIR, for the power grid. To capture the behavior of power grid, we extract its several features and employ its locality property to save the extraction time. XGBIR can be effectively applied to large designs and the average error of predicted IR drops is less than 6 mV. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-11	ON PRE-ASSIGNMENT ROUTE PROTOTYPING FOR IRREGULAR BUMPS ON BGA PACKAGES Speaker: Hung-Ming Chen, National Chiao Tung University, TW Authors: Jyun-Ru Jiang¹, Yun-Chih Kuo², Simon Chen³ and Hung-Ming Chen¹ ¹National Chiao Tung University, TW; ²National Taiwan University, TW; ³MediaTek.inc, TW Abstract In modern package design, the bumps often place irregularly due to the macros varied in sizes and positions. This will make pre-assignment routing more difficult, even with massive design efforts. This work presents a 2-stage routing method which can be applied to an arbitrary bump placement on 2-layer BGA packages. Our approach combines escape routing with via assignment: the escape routing is used to handle the irregular bumps and the via assignment is applied for improving the wire congestion and total wirelength of global routing. Experimental results based on industrial cases show that our methodology can solve the routing efficiently, and we have achieved 82% improvement on wire congestion with 5% wirelength increase compared with conventional regular treatments. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-12	TOWARDS BEST-EFFORT APPROXIMATION: APPLYING NAS TO APPROXIMATE COMPUTING Speaker: Weiwei Chen, Chinese Academy of Sciences, CN Authors: Weiwei Chen, Ying Wang, Shuang Yang, Cheng Liu and Lei Zhang, Chinese Academy of Sciences, CN Abstract The design of neural network architecture for code approximation involves a large number of hyper-parameters to explore, it is a non-trivial task to find an neural-based approximate computing solution that meets the demand of application-specified accuracy and Quality of Service (QoS). Prior works do not address the problem of 'optimal' network architectures design in program approximation, which depends on the user-specified constraints, the complexity of dataset and the hardware configuration. In this paper, we apply Neural Architecture Search (NAS) for searching and selecting the neural approximate computing and provide an automatic framework that tries to generate the best-effort approxi-mation result while satisfying the user-specified QoS/accuracy constraints. Compared with previous method, this work achieves more than 1.43x speedup and 1.74x energy reduction on average when applied to the AxBench benchmarks. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-13	ON THE AUTOMATIC EXPLORATION OF WEIGHT SHARING FOR DEEP NEURAL NETWORK COMPRESSION Speaker: Etienne Dupuis, École Centrale de Lyon, FR Authors: Etienne Dupuis¹, David Novo², Ian O'Connor¹ and Alberto Bosio¹ ¹Lyon Institute of Nanotechnology, FR; ²Université de Montpellier, FR Abstract Deep neural networks demonstrate impressive inference results, particularly in computer vision and speech recognition. However, the computational workload and storage associated render their use prohibitive in resource-limited embedded systems. The approximate computing paradigm has been widely explored in both industrial and academic circles. It improves performance and energy-efficiency by relaxing the need for fully accurate operations. Consequently, there is a large number of implementation options with very different approximation strategies (such as pruning, quantization, low-rank factorization, knowledge distillation, ...). To the best of our knowledge, no automated approach exists for exploring, selecting and generating the best approximate versions of a given convolutional neural network (CNN) and the design objectives. The objective of this work in progress is to show that the design space exploration phase can enable significant network compression without noticeable accuracy loss. We demonstrate this via an example based on weight sharing and show that we can obtain 4x compression rate without re-training and the resulting network does not suffer from accuracy loss, in an int-16 version of LeNet-5 (5-layer 1,720-kbit CNN) using our method. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-14	ROBUST AND HIGH-PERFORMANCE12-T INTERLOCKED SRAM FOR IN-MEMORY COMPUTING Speaker: Joycee Mekie, IIT Gandhinagar, IN Authors: Neelam Surana, Mili Lavania, Abhishek Barma and Joycee Mekie, IIT Gandhinagar, IN Abstract In this paper, we analyze the existing SRAM based In-Memory Computing(IMC) proposals and show through exhaustive simulations that they fail under process variations. 6-T SRAM, 8-T SRAM, and 10-T SRAM based IMC architectures suffer from compute-disturb (stored data flips during IMC), compute-failure (provides false computation results), and half-select failures, respectively. To circumvent these issues, we propose a novel 12-T Dual Port Dual Interlocked-storage Cell (DPDICE) SRAM. DPDICE SRAM based IMC architecture(DPDICE-IMC) can perform essential boolean functions successfully in a single cycle and can perform basic arithmetic operations such as add and multiply. The most striking feature is that DPDICE-IMC architecture can perform IMC on two datasets simultaneously, thus doubling the throughput. Cumulatively, the proposed DPDICE-IMC is 26.7%, 8$imes$, and 28% better than 6-T SRAM, 8-T SRAM, and 10-T SRAM based IMC architectures, respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-15	HIGH DENSITY STT-MRAM COMPILER DESIGN, VALIDATION AND CHARACTERIZATION METHODOLOGY IN 28NM FDSOI TECHNOLOGY Speaker: Piyush Jain, ARM Embedded Technologies Pvt Ltd., IN Authors: Piyush Jain¹, Akshay Kumar¹, Nicolaas Van Winkelhoff², Didier Gayraud², Surya Gupta³, Abdelali El Amraoui², Giorgio Palma², Alexandra Gourio², Laurentz Vachez², Luc Palau², Jean-Christophe Buy² and Cyrille Dray² ¹ARM Embedded Technologies Pvt Ltd., IN; ²ARM France, FR; ³ARM Embedded technologies Pvt Ltd., IN Abstract Spin Transfer Torque Magneto-resistive Random-Access Memory (STT-MRAM) is emerging as a promising substitute for flash memories due to scaling challenges for flash in process nodes beyond 28nm. STT-MRAM's high endurance, fast speed and low power makes it suitable for wide variety of applications. An embedded MRAM (eMRAM) compiler is highly desirable to enable SoC designers to use eMRAM instances in their designs in a flexible manner. However, the development of an eMRAM compiler has added challenges of handling multi-fold higher density and maintaining analog circuits accuracy, on top of the challenges associated with conventional SRAM memory compilers. In this paper, we present a successful design methodology for a high density 128Mb eMRAM compiler in a 28nm fully depleted SOI (FDSOI) process. This compiler enables optimized eMRAM instance generation with varying capacity ranges, word-widths, and optional features like repair and error correction. eMRAM compiler design is achieved by evolving various architecture design, validations and characterization methods. A hierarchical and modular characterization methodology is presented to enable high accuracy characterization and industry-standard EDA view generation from the eMRAM compiler. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-16	AN APPROXIMATION-BASED FAULT DETECTION SCHEME FOR IMAGE PROCESSING APPLICATIONS Speaker: Antonio Miele, Politecnico di Milano, IT Authors: Matteo Biasielli, Luca Cassano and Antonio Miele, Politecnico di Milano, IT Abstract Image processing applications expose an intrinsic resilience to faults. In this application field the classical Duplication with Comparison (DWC) scheme, where output images are discarded as soon as the two replicas' outputs differ for at least one pixel, may be over-conseravative. This paper introduces a novel lightweight fault detection scheme for image processing applications; i) it extends the DWC scheme by substituting one of the two exact replicas with a faster approximated one; and ii) it features a Neural Network-based checker designed to distinguish between usable and unusable images instead of faulty/unfaulty ones. The application of the hardening scheme on a case study has shown an execution time reduction from 27% to 34% w.r.t. the DWC, while guaranteeing a comparable fault detection capability. Download Paper (PDF; Only available from the DATE venue WiFi)
IP4-17	TRANSPORT-FREE MODULE BINDING FOR SAMPLE PREPARATION USING MICROFLUIDIC FULLY PROGRAMMABLE VALVE ARRAYS Speaker: Gautam Choudhary, Adobe Research, India, IN Authors: Gautam Choudhary¹, Sandeep Pal¹, Debraj Kundu¹, Sukanta Bhattacharjee², Shigeru Yamashita³, Bing Li⁴, Ulf Schlichtmann⁴ and Sudip Roy¹ ¹IIT Roorkee, IN; ²Indian Statistical Institute, IN; ³Ritsumeikan University, JP; ⁴TU Munich, DE Abstract Microfluidic fully programmable valve array (FPVA) biochips have emerged as general-purpose flow-based microfluidic lab-on-chips (LoCs). An FPVA supports highly re-configurable on-chip components (modules) in the two-dimensional grid-like structure controlled by some software programs, unlike application-specific flow-based LoCs. Fluids can be loaded into or washed from a cell with the help of flows from the inlet to outlet of an FPVA, whereas cell-to-cell transportation of discrete fluid segment(s) is not precisely possible. The simplest mixing module to realize on an FPVA-based LoC is a four-way mixer consisting of a $2imes2$ array of cells working as a ring-like mixer having four valves. In this paper, we propose a design automation method for sample preparation that finds suitable placements of mixing operations of a mixing tree using four-way mixers without requiring any transportation of fluid(s) between modules. We also propose a heuristic that modifies the mixing tree to reduce the sample preparation time. We have performed an extensive simulation and examined several parameters to determine the performance of the proposed solution. Download Paper (PDF; Only available from the DATE venue WiFi)

Label

Presentation Title
Authors

IP4-1

HIT: A HIDDEN INSTRUCTION TROJAN MODEL FOR PROCESSORS
Speaker:
Jiaqi Zhang, Tongji University, CN
Authors:
Jiaqi Zhang¹, Ying Zhang¹, Huawei Li² and Jianhui Jiang³
¹Tongji University, CN; ²Chinese Academy of Sciences, CN; ³School of Software Engineering, Tongji University, CN
Abstract
This paper explores an intrusion mechanism to microprocessors using illegal instructions, namely hidden instruction Trojan (HIT). It uses a low-probability sequence consisting of normal instructions as a boot sequence, followed by an illegal instruction to trigger the Trojan. The payload is a hidden interrupt to force the program counter to a specific address. Hence the program at the address has the super privileges. Meanwhile, we use integer programming to minimize the trigger probability of HIT within a given area overhead. The experimental results demonstrate that HIT has an extremely low trigger probability and can survive from the detection of the existing test methods.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-2

BITSTREAM MODIFICATION ATTACK ON SNOW 3G
Speaker:
Michail Moraitis, Royal Institute of Technology KTH, SE
Authors:
Michail Moraitis and Elena Dubrova, Royal Institute of Technology - KTH, SE
Abstract
SNOW 3G is one of the core algorithms for confidentiality and integrity in several 3GPP wireless communication standards, including the new Next Generation (NG) 5G. It is believed to be resistant to classical cryptanalysis. In this paper, we show that SNOW 3G can be broken by a fault attack based on bitstream modification. By changing the content of some look-up tables in the bitstream, we reduce the non-linear state updating function of SNOW 3G to a linear one. As a result, it becomes possible to recover the key from a known plaintext-ciphertext pair. To our best knowledge, this is the first successful bitstream modification attack on SNOW 3G.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-3

A MACHINE LEARNING BASED WRITE POLICY FOR SSD CACHE IN CLOUD BLOCK STORAGE
Speaker:
Yu Zhang, Huazhong University of Science & Technology, CN
Authors:
Yu Zhang¹, Ke Zhou¹, Ping Huang², Hua Wang¹, Jianying Hu³, Yangtao Wang¹, Yongguang Ji³ and Bin Cheng³
¹Huazhong University of Science & Technology, CN; ²Temple University, US; ³Tencent Technology (Shenzhen) Co., Ltd., CN
Abstract
Nowadays, SSD cache plays an important role in cloud storage systems. The associated write policy, which enforces an admission control policy regarding filling data into the cache, has a significant impact on the performance of the cache system and the amount of write traffic to SSD caches. Based on our analysis on a typical cloud block storage system, approximately 47.09% writes are write-only, i.e., writes to the blocks which have not been read during a certain time window. Naively writing the write-only data to the SSD cache unnecessarily introduces a large number of harmful writes to the SSD cache without any contribution to cache performance. On the other hand, it is a challenging task to identify and filter out those write-only data in a real-time manner, especially in a cloud environment running changing and diverse workloads. In this paper, to alleviate the above cache problem, we propose an ML-WP, Machine Learning Based Write Policy, which reduces write traffic to SSDs by avoiding writing write-only data. The main challenge in this approach is to identify write-only data in a real-time manner. To realize ML-WP and achieve accurate write-only data identification, we use machine learning methods to classify data into two groups (i.e., write-only and normal data). Based on this classification, the write-only data is directly written to backend storage without being cached. Experimental results show that, compared with the industry widely deployed write-back policy, ML-WP decreases write traffic to SSD cache by 41.52%, while improving the hit ratio by 2.61% and reducing the average read latency by 37.52%.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-4

YOU ONLY SEARCH ONCE: A FAST AUTOMATION FRAMEWORK FOR SINGLE-STAGE DNN/ACCELERATOR CO-DESIGN
Speaker:
Weiwei Chen, Chinese Academy of Sciences, CN
Authors:
Weiwei Chen, Ying Wang, Shuang Yang, Cheng Liu and Lei Zhang, Chinese Academy of Sciences, CN
Abstract
DNN/Accelerator co-design has shown great poten-tial in improving QoR and performance. Typical approaches separate the design flow into two-stage: (1) designing an application-specific DNN model with high accuracy; (2) building an accelerator considering the DNN specific characteristics. However, it may fails in promising the highest composite score which combines the goals of accuracy and other hardware-related constraints (e.g., latency, energy efficiency) when building a specific neural-network-based system. In this work, we present a single-stage automated framework, YOSO, aiming to generate the optimal solution of software-and-hardware that flexibly balances between the goal of accuracy, power, and QoS. Compared with the two-stage method on the baseline systolic array accelerator and Cifar10 dataset, we achieve 1.42x~2.29x energy or 1.79x~3.07x latency reduction at the same level of precision, for different user-specified energy and latency optimization constraints, respectively.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-5

WHEN SORTING NETWORK MEETS PARALLEL BITSTREAMS: A FAULT-TOLERANT PARALLEL TERNARY NEURAL NETWORK ACCELERATOR BASED ON STOCHASTIC COMPUTING
Speaker:
Yawen Zhang, Peking University, CN
Authors:
Yawen Zhang¹, Sheng Lin², Runsheng Wang¹, Yanzhi Wang², Yuan Wang¹, Weikang Qian³ and Ru Huang¹
¹Peking University, CN; ²Northeastern University, US; ³Shanghai Jiao Tong University, CN
Abstract
Stochastic computing (SC) has been widely used in neural networks (NNs) due to its simple hardware cost and high fault tolerance. Conventionally, SC-based NN accelerators adopt a hybrid stochastic-binary format, using an accumulative parallel counter to convert bitstreams into a binary number. This method, however, sacrifices the fault tolerance and causes a high hardware cost. In order to fully exploit the superior fault tolerance of SC, taking a ternary neural network (TNN) as an example, we propose a parallel SC-based NN accelerator purely using bitstream computation. We apply a bitonic sorting network for simultaneously implementing the accumulation and activation function with parallel bitstreams. The proposed design not only has high fault tolerance, but also achieves at least 2.8 energy efficiency improvement over the binary computing counterpart.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-6

WAVEPRO: CLOCK-LESS WAVE-PROPAGATED PIPELINE COMPILER FOR LOW-POWER AND HIGH-THROUGHPUT COMPUTATION
Speaker:
Yehuda Kra, Bar-Ilan University, IL
Authors:
Yehuda Kra, Adam Teman and Tzachi Noy, Bar-Ilan University, IL
Abstract
Clock-less Wave-Propagated Pipelining is a longknown approach to achieve high-throughput without the overhead of costly sampling registers. However, due to many design challenges, which have only increased with technology scaling, this approach has never been widely accepted and has generally been limited to small and very specific demonstrations. This paper addresses this barrier by presenting WavePro, a generic and scalable algorithm, capable of skew balancing any combinatorial logic netlist for the application of wave pipelining. The algorithm was implemented in the WavePro Compiler automation utility, which interfaces with industry delays extraction and standard timing analysis tools to produce a sign-off quality result. The utility is demonstrated upon a dot-product accelerator in a 65 nm CMOS technology, using a vendor-provided standard cell library and commercial timing analysis tools. By reducing the worstcase output skew by over 70%, the test case example was able to achieve equivalent throughput of an 8-staged sequentially pipelined implementation with power savings of almost 3X.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-7

DEEPNVM: A FRAMEWORK FOR MODELING AND ANALYSIS OF NON-VOLATILE MEMORY TECHNOLOGIES FOR DEEP LEARNING APPLICATIONS
Speaker:
Ahmet Inci, Carnegie Mellon University, US
Authors:
Ahmet Inci, Mehmet M Isgenc and Diana Marculescu, Carnegie Mellon University, US
Abstract
Non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While previous work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM, a framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. We present both iso-capacity and iso-area performance and energy analysis for systems whose last-level caches rely on conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 4.2x and 5x energy-delay product (EDP) reduction and 2.4x and 3x area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide 2.3x EDP reduction on average across all workloads when compared to SRAM. Our comprehensive cross-layer framework is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPU platforms for deep learning applications.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-8

EFFICIENT EMBEDDED MACHINE LEARNING APPLICATIONS USING ECHO STATE NETWORKS
Speaker:
Rolando Brondolin, Politecnico di Milano, IT
Authors:
Luca Cerina¹, Giuseppe Franco², Claudio Gallicchio³, Alessio Micheli³ and Marco D. Santambrogio⁴
¹politecnico di milano, IT; ²Scuola Superiore Sant'Anna / Università di Pisa, IT; ³Università di Pisa, IT; ⁴Politecnico di Milano, IT
Abstract
The increasing role of Artificial Intelligence (AI) and Machine Learning (ML) in our lives brought a paradigm shift on how and where the computation is performed. Stringent latency requirements and congested bandwidth moved AI inference from Cloud space towards end-devices. This change required a major simplification of Deep Neural Networks (DNN), with memory-wise libraries or co-processors that perform fast inference with minimal power. Unfortunately, many applications such as natural language processing, time-series analysis and audio interpretation are built on a different type of Artifical Neural Networks (ANN), the so-called Recurrent Neural Networks (RNN), which, due to their intrinsic architecture, remains too complex and heavy to run efficiently on embedded devices. To solve this issue, the Reservoir Computing paradigm proposes sparse untrained non-linear networks, the Reservoir, that can embed temporal relations without some of the hindrances of Recurrent Neural Networks training, and with a lower memory usage. Echo State Networks (ESN) and Liquid State Machines are the most notable examples. In this scenario, we propose a performance comparison of a ESN, designed and trained using Bayesian Optimization techniques, against current RNN solutions. We aim to demonstrate that ESN have comparable performance in terms of accuracy, require minimal training time, and they are more optimized in terms of memory usage and computational efficiency. Preliminary results show that ESN are competitive with RNN on a simple benchmark, and both training and inference time are faster, with a maximum speed-up of 2.35x and 6.60x, respectively.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-9

EXPLFRAME: EXPLOITING PAGE FRAME CACHE FOR FAULT ANALYSIS OF BLOCK CIPHERS
Speaker:
Anirban Chakraborty, IIT Kharagpur, IN
Authors:
Anirban Chakraborty¹, Sarani Bhattacharya², Sayandeep Saha¹ and Debdeep Mukhopadhyay¹
¹IIT Kharagpur, IN; ²KU Leuven, BE
Abstract
Page Frame Cache (PFC) is a purely software cache, present in modern Linux based operating systems (OS), which stores the page frames that were recently released by the processes running on a particular CPU. In this paper, we show that the page frame cache can be maliciously exploited by an adversary to steer the pages of a victim process to some pre-decided attacker-chosen locations in the memory. We practically demonstrate an end-to-end attack, emph{ExplFrame}, where an attacker having only user-level privilege is able to force a victim process's memory pages to vulnerable locations in DRAM and deterministically conduct Rowhammer to induce faults. As a case study, we induce single bit faults in the T-tables on OpenSSL (v1.1.1) AES using our proposed attack ExplFrame. We also propose an improvised fault analysis technique which can exploit any Rowhammer-induced bit-flips in the AES T-tables.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-10

XGBIR: AN XGBOOST-BASED IR DROP PREDICTOR FOR POWER DELIVERY NETWORK
Speaker:
An-Yu Su, National Chiao Tung University, TW
Authors:
Chi-Hsien Pao, Yu-Min Lee and An-Yu Su, National Chiao Tung University, TW
Abstract
This work utilizes the XGBoost to build a machine-learning-based IR drop predictor, XGBIR, for the power grid. To capture the behavior of power grid, we extract its several features and employ its locality property to save the extraction time. XGBIR can be effectively applied to large designs and the average error of predicted IR drops is less than 6 mV.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-11

ON PRE-ASSIGNMENT ROUTE PROTOTYPING FOR IRREGULAR BUMPS ON BGA PACKAGES
Speaker:
Hung-Ming Chen, National Chiao Tung University, TW
Authors:
Jyun-Ru Jiang¹, Yun-Chih Kuo², Simon Chen³ and Hung-Ming Chen¹
¹National Chiao Tung University, TW; ²National Taiwan University, TW; ³MediaTek.inc, TW
Abstract
In modern package design, the bumps often place irregularly due to the macros varied in sizes and positions. This will make pre-assignment routing more difficult, even with massive design efforts. This work presents a 2-stage routing method which can be applied to an arbitrary bump placement on 2-layer BGA packages. Our approach combines escape routing with via assignment: the escape routing is used to handle the irregular bumps and the via assignment is applied for improving the wire congestion and total wirelength of global routing. Experimental results based on industrial cases show that our methodology can solve the routing efficiently, and we have achieved 82% improvement on wire congestion with 5% wirelength increase compared with conventional regular treatments.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-12

TOWARDS BEST-EFFORT APPROXIMATION: APPLYING NAS TO APPROXIMATE COMPUTING
Speaker:
Weiwei Chen, Chinese Academy of Sciences, CN
Authors:
Weiwei Chen, Ying Wang, Shuang Yang, Cheng Liu and Lei Zhang, Chinese Academy of Sciences, CN
Abstract
The design of neural network architecture for code approximation involves a large number of hyper-parameters to explore, it is a non-trivial task to find an neural-based approximate computing solution that meets the demand of application-specified accuracy and Quality of Service (QoS). Prior works do not address the problem of 'optimal' network architectures design in program approximation, which depends on the user-specified constraints, the complexity of dataset and the hardware configuration. In this paper, we apply Neural Architecture Search (NAS) for searching and selecting the neural approximate computing and provide an automatic framework that tries to generate the best-effort approxi-mation result while satisfying the user-specified QoS/accuracy constraints. Compared with previous method, this work achieves more than 1.43x speedup and 1.74x energy reduction on average when applied to the AxBench benchmarks.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-13

ON THE AUTOMATIC EXPLORATION OF WEIGHT SHARING FOR DEEP NEURAL NETWORK COMPRESSION
Speaker:
Etienne Dupuis, École Centrale de Lyon, FR
Authors:
Etienne Dupuis¹, David Novo², Ian O'Connor¹ and Alberto Bosio¹
¹Lyon Institute of Nanotechnology, FR; ²Université de Montpellier, FR
Abstract
Deep neural networks demonstrate impressive inference results, particularly in computer vision and speech recognition. However, the computational workload and storage associated render their use prohibitive in resource-limited embedded systems. The approximate computing paradigm has been widely explored in both industrial and academic circles. It improves performance and energy-efficiency by relaxing the need for fully accurate operations. Consequently, there is a large number of implementation options with very different approximation strategies (such as pruning, quantization, low-rank factorization, knowledge distillation, ...). To the best of our knowledge, no automated approach exists for exploring, selecting and generating the best approximate versions of a given convolutional neural network (CNN) and the design objectives. The objective of this work in progress is to show that the design space exploration phase can enable significant network compression without noticeable accuracy loss. We demonstrate this via an example based on weight sharing and show that we can obtain 4x compression rate without re-training and the resulting network does not suffer from accuracy loss, in an int-16 version of LeNet-5 (5-layer 1,720-kbit CNN) using our method.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-14

ROBUST AND HIGH-PERFORMANCE12-T INTERLOCKED SRAM FOR IN-MEMORY COMPUTING
Speaker:
Joycee Mekie, IIT Gandhinagar, IN
Authors:
Neelam Surana, Mili Lavania, Abhishek Barma and Joycee Mekie, IIT Gandhinagar, IN
Abstract
In this paper, we analyze the existing SRAM based In-Memory Computing(IMC) proposals and show through exhaustive simulations that they fail under process variations. 6-T SRAM, 8-T SRAM, and 10-T SRAM based IMC architectures suffer from compute-disturb (stored data flips during IMC), compute-failure (provides false computation results), and half-select failures, respectively. To circumvent these issues, we propose a novel 12-T Dual Port Dual Interlocked-storage Cell (DPDICE) SRAM. DPDICE SRAM based IMC architecture(DPDICE-IMC) can perform essential boolean functions successfully in a single cycle and can perform basic arithmetic operations such as add and multiply. The most striking feature is that DPDICE-IMC architecture can perform IMC on two datasets simultaneously, thus doubling the throughput. Cumulatively, the proposed DPDICE-IMC is 26.7%, 8$imes$, and 28% better than 6-T SRAM, 8-T SRAM, and 10-T SRAM based IMC architectures, respectively.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-15

HIGH DENSITY STT-MRAM COMPILER DESIGN, VALIDATION AND CHARACTERIZATION METHODOLOGY IN 28NM FDSOI TECHNOLOGY
Speaker:
Piyush Jain, ARM Embedded Technologies Pvt Ltd., IN
Authors:
Piyush Jain¹, Akshay Kumar¹, Nicolaas Van Winkelhoff², Didier Gayraud², Surya Gupta³, Abdelali El Amraoui², Giorgio Palma², Alexandra Gourio², Laurentz Vachez², Luc Palau², Jean-Christophe Buy² and Cyrille Dray²
¹ARM Embedded Technologies Pvt Ltd., IN; ²ARM France, FR; ³ARM Embedded technologies Pvt Ltd., IN
Abstract
Spin Transfer Torque Magneto-resistive Random-Access Memory (STT-MRAM) is emerging as a promising substitute for flash memories due to scaling challenges for flash in process nodes beyond 28nm. STT-MRAM's high endurance, fast speed and low power makes it suitable for wide variety of applications. An embedded MRAM (eMRAM) compiler is highly desirable to enable SoC designers to use eMRAM instances in their designs in a flexible manner. However, the development of an eMRAM compiler has added challenges of handling multi-fold higher density and maintaining analog circuits accuracy, on top of the challenges associated with conventional SRAM memory compilers. In this paper, we present a successful design methodology for a high density 128Mb eMRAM compiler in a 28nm fully depleted SOI (FDSOI) process. This compiler enables optimized eMRAM instance generation with varying capacity ranges, word-widths, and optional features like repair and error correction. eMRAM compiler design is achieved by evolving various architecture design, validations and characterization methods. A hierarchical and modular characterization methodology is presented to enable high accuracy characterization and industry-standard EDA view generation from the eMRAM compiler.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-16

AN APPROXIMATION-BASED FAULT DETECTION SCHEME FOR IMAGE PROCESSING APPLICATIONS
Speaker:
Antonio Miele, Politecnico di Milano, IT
Authors:
Matteo Biasielli, Luca Cassano and Antonio Miele, Politecnico di Milano, IT
Abstract
Image processing applications expose an intrinsic resilience to faults. In this application field the classical Duplication with Comparison (DWC) scheme, where output images are discarded as soon as the two replicas' outputs differ for at least one pixel, may be over-conseravative. This paper introduces a novel lightweight fault detection scheme for image processing applications; i) it extends the DWC scheme by substituting one of the two exact replicas with a faster approximated one; and ii) it features a Neural Network-based checker designed to distinguish between usable and unusable images instead of faulty/unfaulty ones. The application of the hardening scheme on a case study has shown an execution time reduction from 27% to 34% w.r.t. the DWC, while guaranteeing a comparable fault detection capability.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP4-17

TRANSPORT-FREE MODULE BINDING FOR SAMPLE PREPARATION USING MICROFLUIDIC FULLY PROGRAMMABLE VALVE ARRAYS
Speaker:
Gautam Choudhary, Adobe Research, India, IN
Authors:
Gautam Choudhary¹, Sandeep Pal¹, Debraj Kundu¹, Sukanta Bhattacharjee², Shigeru Yamashita³, Bing Li⁴, Ulf Schlichtmann⁴ and Sudip Roy¹
¹IIT Roorkee, IN; ²Indian Statistical Institute, IN; ³Ritsumeikan University, JP; ⁴TU Munich, DE
Abstract
Microfluidic fully programmable valve array (FPVA) biochips have emerged as general-purpose flow-based microfluidic lab-on-chips (LoCs). An FPVA supports highly re-configurable on-chip components (modules) in the two-dimensional grid-like structure controlled by some software programs, unlike application-specific flow-based LoCs. Fluids can be loaded into or washed from a cell with the help of flows from the inlet to outlet of an FPVA, whereas cell-to-cell transportation of discrete fluid segment(s) is not precisely possible. The simplest mixing module to realize on an FPVA-based LoC is a four-way mixer consisting of a $2imes2$ array of cells working as a ring-like mixer having four valves. In this paper, we propose a design automation method for sample preparation that finds suitable placements of mixing operations of a mixing tree using four-way mixers without requiring any transportation of fluid(s) between modules. We also propose a heuristic that modifies the mixing tree to reduce the sample preparation time. We have performed an extensive simulation and examined several parameters to determine the performance of the proposed solution.
Download Paper (PDF; Only available from the DATE venue WiFi)