IP1 Interactive Presentations

Label	Presentation Title Authors
IP1-1	RECOM: AN EFFICIENT RESISTIVE ACCELERATOR FOR COMPRESSED DEEP NEURAL NETWORKS Speaker: Houxiang Ji, Shanghai Jiao Tong University, CN Authors: Houxiang Ji¹, Linghao Song², Li Jiang¹, Hai (Helen) Li³ and Yiran Chen² ¹Shanghai Jiao Tong University, CN; ²Duke University, US; ³Duke University/TUM-IAS, US Abstract Deep Neural Networks (DNNs) play a key role in prevailing machine learning applications. Resistive random-access memory (ReRAM) is capable of both computation and storage, contributing to the acceleration on DNNs process in memory. Besides, DNNs have a significant amount of zero weights, which provides a possibility to reduce computation cost by skipping ineffectual calculations on zero weights. However, the irregular distribution of zero weights in DNNs makes it difficult for resistive accelerators to take advantage of the sparsity, because resistive accelerators have a high reliance on regular matrix-vector multiplication in ReRAM. In this work, we propose ReCom, the first resistive accelerator to support sparse DNN processing. ReCom is an efficient resistive accelerator for compressed deep neural networks, where DNN weights are structurally compressed to eliminate zero parameters and become more friendly to computation in ReRAM, and zero DNN activations are also considered at the same time. Two technologies, Structurally-compressed Weight Oriented Fetching (SWOF) and In-layer Pipeline for Memory and Computation (IPMC),are particularly proposed to efficiently process the compressed DNNs in ReRAM. In our evaluation, ReCom can achieve 3.37x speedup and 2.41x energy efficiency compared to a state-of-the-art resistive accelerator. Download Paper (PDF; Only available from the DATE venue WiFi)
IP1-2	SPARSENN: AN ENERGY-EFFICIENT NEURAL NETWORK ACCELERATOR EXPLOITING INPUT AND OUTPUT SPARSITY Speaker: Jingyang Zhu, Hong Kong University of Science and Technology, HK Authors: Jingyang Zhu, Jingbo Jiang, Xizi Chen and Chi-Ying Tsui, Hong Kong University of Science and Technology, HK Abstract Contemporary Deep Neural Network (DNN) contains millions of synaptic connections with tens to hundreds of layers. The large computational complexity poses a challenge to the hardware design. In this work, we leverage the intrinsic activation sparsity of DNN to substantially reduce the execution cycles and the energy consumption. An end-to-end training algorithm is proposed to develop a lightweight (less than 5% overhead) run-time predictor for the output activation sparsity on the fly. Furthermore, an energy-efficient hardware architecture, SparseNN, is proposed to exploit both the input and output sparsity. SparseNN is a scalable architecture with distributed memories and processing elements connected through a dedicated on-chip network. Compared with the state-of-the-art accelerators which only exploit the input sparsity, SparseNN can achieve a 10%-70% improvement in throughput and a power reduction of around 50%. Download Paper (PDF; Only available from the DATE venue WiFi)
IP1-3	ACCLIB: ACCELERATORS AS LIBRARIES Speaker: Jacob R. Stevens, Purdue University, US Authors: Jacob Stevens¹, Yue Du², Vivek Kozhikkottu³ and Anand Raghunathan¹ ¹Purdue University, US; ²IBM, US; ³Intel Corporation, US Abstract Accelerator-based computing, which has been a mainstay of System-on-Chips (SoCs) is of growing interest to a wider range of computing systems. However, the significant design effort required to identify a computational target for acceleration, design a hardware accelerator, verify the correctness of the accelerator, integrate the accelerator into the system, and rewrite applications to use the accelerator, is a major bottleneck to the widespread adoption of accelerator-based computing. The classical approach to this problem is based on top-down methodologies such as automatic HW/SW partitioning and high-level synthesis (HLS). While HLS has advanced significantly and is seeing increased adoption, it does not leverage the ability of experienced human designers to craft highly optimized RTL, nor does it leverage the growing body of already existing hardware accelerators. In this work, we propose ACCLIB, a design framework that allows software developers to utilize existing libraries of pre-designed hardware accelerators automatically with no prior knowledge of the function of the accelerators, with minimal knowledge of hardware design, and with minimal design effort. To accomplish this, ACCLIB uses formal verification techniques to match a target software function with a functionally equivalent accelerator from a library of accelerators. It also generates the required HW/SW interfaces as well as the code necessary to offload the computation to the accelerator. We validate ACCLIB by applying it to accelerate six different applications using a library of hardware accelerators in just over one hour per application, demonstrating that the proposed approach has the potential to lower the barrier to adoption of accelerator-based computing. Download Paper (PDF; Only available from the DATE venue WiFi)
IP1-4	HPXA: A HIGHLY PARALLEL XML PARSER Speaker: Smruti Sarangi, IIT Delhi, IN Authors: Isaar Ahmad, Sanjog Patil and Smruti R. Sarangi, IIT Delhi, IN Abstract State of the art XML parsing approaches read an XML file byte by byte, and use complex finite state machines to process each byte. In this paper, we propose a new parser, HPXA, which reads and processes 16 bytes at a time. We designed most of the components ab initio, to ensure that they can process multiple XML tokens and tags in parallel. We propose two basic elements - a sparse 1D array compactor, and a hardware unit called LTMAdder that takes its decisions based on adding the rows of a lower triangular matrix. We demonstrate that we are able to process 16 bytes in parallel with very few pipeline stalls for a suite of widely used XML benchmarks. Moreover, for a 28nm technology node, we can process XML data at 106 Gbps, which is roughly 6.5X faster than competing prior work. Download Paper (PDF; Only available from the DATE venue WiFi)
IP1-5	QOR-AWARE POWER CAPPING FOR APPROXIMATE BIG DATA PROCESSING Speaker: Sherief Reda, Brown University, US Authors: Seyed Morteza Nabavinejad¹, Xin Zhan², Reza Azimi², Maziar Goudarzi¹ and Sherief Reda² ¹Sharif University of Technology, IR; ²Brown University, US Abstract To limit the peak power consumption of a cluster, a centralized power capping system typically assigns power caps to the individual servers, which are then enforced using local capping controllers. Consequently, the performance and throughput of the servers are affected, and the runtime of jobs is extended as a result. We observe that servers in big data processing clusters often execute big data applications that have different tolerance for approximate results. To mitigate the impact of power capping, we propose a new power-Capping aware resource manager for Approximate Big data processing (CAB) that takes into consideration the minimum Quality-of-Result (QoR) of the jobs. We use industry standard feedback power capping controllers to enforce a power cap quickly, while, simultaneously modifying the resource allocations to various jobs based on their progress rate, target minimum QoR, and the power cap such that the impact of capping on runtime is minimized. Based on the applied cap and the progress rates of jobs, CAB dynamically allocates the computing resources (i.e., number of cores and memory) to the jobs to mitigate the impact of capping on the finish time. We implement CAB in Hadoop-2.7.3 and evaluate its improvement over other methods on a state-of-the-art 28-core Xeon server. We demonstrate that CAB minimizes the impact of power capping on runtime by up to 39.4% while meeting the minimum QoR constraints. Download Paper (PDF; Only available from the DATE venue WiFi)
IP1-6	EXACT MULTI-OBJECTIVE DESIGN SPACE EXPLORATION USING ASPMT Speaker: Kai Neubauer, University of Rostock, DE Authors: Kai Neubauer¹, Philipp Wanko², Torsten Schaub² and Christian Haubelt¹ ¹University of Rostock, DE; ²University of Potsdam, DE Abstract An efficient Design Space Exploration (DSE) is imperative for the design of modern, highly complex embedded systems in order to steer the development towards optimal design points. The early evaluation of design decisions at system-level abstraction layer helps to find promising regions for subsequent development steps in lower abstraction levels by diminishing the complexity of the search problem. In recent works, symbolic techniques, especially Answer Set Programming (ASP) modulo Theories (ASPmT), have been shown to find feasible solutions of highly complex system-level synthesis problems with non-linear constraints very efficiently. In this paper, we present a novel approach to a holistic system-level DSE based on ASPmT. To this end, we include additional background theories that concurrently guarantee compliance with hard constraints and perform the simultaneous optimization of several design objectives. We implement and compare our approach with a state-of-the-art preference handling framework for ASP. Experimental results indicate that our proposed method produces better solutions with respect to both diversity and convergence to the true Pareto front. Download Paper (PDF; Only available from the DATE venue WiFi)
IP1-7	HIPE: HMC INSTRUCTION PREDICATION EXTENSION APPLIED ON DATABASE PROCESSING Speaker: Diego Tomé, Centrum Wiskunde & Informatica (CWI), BR Authors: Diego Gomes Tomé¹, Paulo Cesar Santos², Luigi Carro², Eduardo Cunha de Almeida³ and Marco Antonio Zanata Alves³ ¹Federal University of Paraná, BR; ²UFRGS, BR; ³UFPR, BR Abstract The recent Hybrid Memory Cube (HMC) is a smart memory which includes functional units inside one logic layer of the 3D stacked memory design. In order to execute instructions inside the Hybrid Memory Cube (HMC), the processor needs to send instructions to be executed near data, keeping most of the pipeline complexity inside the processor. Thus, control-flow and data-flow dependencies are all managed inside the processor, in such way that only update instructions are supported by the HMC. In order to solve data-flow dependencies inside the memory, previous work proposed HMC Instruction Vector Extensions(HIVE), which embeds a high number of functional units with an interlock register bank. In this work, we propose HMC Instruction Prediction Extensions (HIPE), that supports predicated execution inside the memory, in order to transform control-flow dependencies into data-flow dependencies. Our mechanism focuses on removing the high latency iteration between the processor and the smart memory during the execution of branches that depends on data processed inside the memory. In this paper, we evaluate a balanced design of HIVE comparing to x86 and HMC executions.After we show the HIPE mechanism results when executing a database workload, which is a strong candidate to use smart memories. We show interesting trade-offs of performance when comparing our mechanism to previous work. Download Paper (PDF; Only available from the DATE venue WiFi)
IP1-8	PARAMETRIC FAILURE MODELING AND YIELD ANALYSIS FOR STT-MRAM Speaker: Sarath Mohanachandran Nair, Karlsruhe Institute of Technology, DE Authors: Sarath Mohanachandran Nair, Rajendra Bishnoi and Mehdi Tahoori, Karlsruhe Institute of Technology, DE Abstract The emerging Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) is a promising candidate to replace conventional on-chip memory technologies due to its advantages such as non-volatility, high density, scalability and unlimited endurance. However, as the technology scales, yield loss due to extreme parametric variations is becoming a major challenge for STT-MRAM because of its higher sensitivity to process variations as compared to CMOS memories. In addition, the parametric variations in STT-MRAM exacerbates its stochastic switching behavior, leading to both test time fails and reliability failures in the field. Since an STT-MRAM memory array consists of both CMOS and magnetic components, it is important to consider variations in both these components to obtain the failures at the system level. In this work, we model the parametric failures of STT-MRAM at the system level considering the correlation among bit-cells as well as the impact of peripheral components. The proposed approach provides realistic fault distribution maps and equips the designer to investigate the efficacy of different combinations of defect tolerance techniques for an effective design-for-yield exploration. Download Paper (PDF; Only available from the DATE venue WiFi)
IP1-9	ONE-WAY SHARED MEMORY Speaker and Author: Martin Schoeberl, Technical University of Denmark, DK Abstract Standard multicore processors use the shared main memory via the on-chip caches for communication between cores. However, this form of communication has two limitations: (1) it is hardly time-predictable and therefore not a good solution for real-time systems and (2) this single shared memory is a bottleneck in the system. This paper presents a communication architecture for time-predictable multicore systems where core-local memories are distributed on the chip. A network-on-chip constantly copies data from a sender core-local memory to a receiver core-local memory. As this copying is performed in one direction we call this architecture a one-way shared memory. With the use of time-division multiplexing for the memory accesses and the network-on-chip routers we achieve a time-predictable solution where the communication latency and bandwidth can be bounded. An example architecture for a 3x3 core processor and 32-bit wide links and memory ports provides a cumulative bandwidth of 29 bytes per clock cycle. Furthermore, the evaluation shows that this architecture, due to its simplicity, is small compared to other network-on-chip solutions. Download Paper (PDF; Only available from the DATE venue WiFi)
IP1-10	AN EFFICIENT RESOURCE-OPTIMIZED LEARNING PREFETCHER FOR SOLID STATE DRIVES Speaker: Rui Xu, University of Science and Technology of China, CN Authors: Rui Xu, Xi Jin, Linfeng Tao, Shuaizhi Guo, Zikun Xiang and Teng Tian, Strongly-Coupled Quantum Matter Physics, Chinese Academy of Sciences, School of Physical Sciences, University of Science and Technology of China, Hefei, Anhui, China, CN Abstract In recent years, solid-state drives (SSDs) have been widely deployed in modern storage systems. To increase the performance of SSDs, prefetchers for SSDs have been designed both at operating system (OS) layer and flash translation layer (FTL). Prefetchers in FTL have many advantages like OS-independence, easy-using, and compatibility. However, due to the limitation of computing capabilities and memory resources, existing prefetchers in FTL merely employ simple sequential prefetching which may incur high penalty cost for I/O access stream with complex patterns. In this paper, an efficient learning prefetcher implemented in FTL is proposed. Considering the resource limitation of SSDs, a learning algorithm based on Markov chains is employed and optimized so that high hit ratio and low penalty cost can be achieved even for complex access patterns. To validate our design, a simulator with the prefetcher is designed and implemented based on Flashsim. The TPC-H benchmark and an application launch trace are tested on the simulator. According to experimental results of the TPC-H benchmark, more than 90% of memory cost can be saved in comparison with a previous design at OS layer. The hit ratio can be increased by 24.1% and the number of times of misprefetching can be reduced by 95.8% in comparison with the simple sequential prefetching strategy. Download Paper (PDF; Only available from the DATE venue WiFi)
IP1-11	BRIDGING DISCRETE AND CONTINUOUS TIME MODELS WITH ATOMS Speaker: George Ungureanu, KTH Royal Institute of Technology, SE Authors: George Ungureanu¹, José E. G. de Medeiros² and Ingo Sander¹ ¹KTH Royal Institute of Technology, SE; ²University of Brasília, BR Abstract Recent trends in replacing traditionally digital components with analog counterparts in order to overcome physical limitations have led to an increasing need for rigorous modeling and simulation of hybrid systems. Combining the two domains under the same set of semantics is not straightforward and often leads to chaotic and non-deterministic behavior due to the lack of a common understanding of aspects concerning time. We propose an algebra of primitive interactions between continuous and discrete aspects of systems which enables their description within two orthogonal layers of computation. We show its benefits from the perspective of modeling and simulation, through the example of an RC oscillator modeled in a formal framework implementing this algebra. Download Paper (PDF; Only available from the DATE venue WiFi)
IP1-12	OHEX: OS-AWARE HYBRIDIZATION TECHNIQUES FOR ACCELERATING MPSOC FULL-SYSTEM SIMULATION Speaker: Róbert Lajos Bücs, Institute for Communication Technologies and Embedded Systems, RWTH Aachen University, DE Authors: Róbert Lajos Bücs¹, Maximilian Fricke², Rainer Leupers¹, Gerd Ascheid¹, Stephan Tobies² and Andreas Hoffmann² ¹Institute for Communication Technologies and Embedded Systems, RWTH Aachen University, DE; ²Synopsys GmbH, DE Abstract Virtual platform (VP) technology is an established enabler of embedded system design. However, the sheer number of CPU models in modern multi-core VPs forms a performance bottleneck. Hybrid simulation addresses this issue by executing parts of the embedded software stack on the host. Although the approach is significantly faster, hybridization can not cope with higher software layers, e.g., Operating Systems (OSs). Thus, this paper presents the OS-aware Host EXtension (OHEX) framework to accelerate VPs while expanding the applicability of hybridization. OHEX is evaluated on various system layers, yielding speedups between 2.99x-21.14x with specific benchmarks. Download Paper (PDF; Only available from the DATE venue WiFi)
IP1-13	A HIGHLY EFFICIENT FULL-SYSTEM VIRTUAL PROTOTYPE BASED ON VIRTUALIZATION-ASSISTED APPROACH Speaker: Hsin-I Wu, National Tsing Hua University, Department of Computer Science, Hsinchu, Taiwan, TW Authors: Hsin-I Wu, Chi-Kang Chen, Tsung-Ying Lu and Ren-Song Tsay, National Tsing Hua University, TW Abstract An effective full-system virtual prototype is critical for early-stage systems design exploration. Generally, however, traditional acceleration approaches of virtual prototypes cannot accurately analyze system performance and model non-deterministic inter-component interactions due to the unpredictability of simulation progress. In this paper, we propose an effective virtualization-assisted approach for modeling and performance analysis. First, we develop a deterministic synchronization process that manages the interactions affecting the data dependency in chronological order to model inter-component interactions consistently. Next, we create accurate timing and bus contention models based on runtime operation statistics for analyzing system performance. We implement the proposed virtualization-assisted approach on an off-the-shelf System-on-Chip (SoC) board to demonstrate the effectiveness of our idea. The experimental results show that the proposed approach runs 12~77 times faster than a commercial virtual prototyping tool and performance estimation is only 3~6% apart from real systems. Download Paper (PDF; Only available from the DATE venue WiFi)
IP1-14	INDUSTRIAL EVALUATION OF TRANSITION FAULT TESTING FOR COST EFFECTIVE OFFLINE ADAPTIVE VOLTAGE SCALING Speaker: Mahroo Zandrahimi, TU Delft, NL Authors: Mahroo Zandrahimi¹, Philippe Debaud², Armand Castillejo² and Zaid Al-Ars¹ ¹Delft University of Technology, NL; ²STMicroelectronics, FR Abstract Adaptive voltage scaling (AVS) has been used widely to compensate for process, voltage, and temperature variations as well as power optimization of integrated circuits. The current industrial state-of-the-art AVS approaches using Process Monitoring Boxes (PMBs) have shown several limitations such as huge characterization effort, which makes these approaches very expensive, and a low accuracy that results in extra margins, which consequently lead to yield loss and performance limitations. To overcome those limitations, in this paper we propose an alternative solution using transition fault test patterns, which is able to eliminate the need for PMBs, while improving the accuracy of voltage estimation. The paper shows, using simulation of ISCAS'99 benchmarks with 28nm FD-SOI library, that AVS using transition fault testing (TF-based AVS) results in an error as low as 5.33%. The paper also shows that the PMB approach can only account for 85% of the uncertainty in voltage measurements, which results in power waste, while the TF-based approach can account for 99% of that uncertainty. Download Paper (PDF; Only available from the DATE venue WiFi)
IP1-15	AN ANALYSIS ON RETENTION ERROR BEHAVIOR AND POWER CONSUMPTION OF RECENT DDR4 DRAMS Speaker: Deepak M. Mathew, University of Kaiserslautern, DE Authors: Deepak M. Mathew¹, Martin Schultheis¹, Carl C. Rheinländer¹, Chirag Sudarshan¹, Matthias Jung², Christian Weis¹ and Norbert Wehn¹ ¹University of Kaiserslautern, DE; ²Fraunhofer IESE, DE Abstract DRAM technology is scaling aggressively that results in high leakage power, worse data retention time behavior, and large process variations. Due to these process variations, vendors provide large guard bands on various DRAM currents and timing specifications that are over pessimistic. Detailed knowledge on the DRAM retention behavior and currents for the average case allow to improve memory system performance and energy efficiency of specific applications by moving away from worst case behavior. In this paper, we present an advanced measurement platform to investigate off-the-shelf DDR4 DRAMs' retention behavior, and to precisely measure various DRAM currents (IDDs and IPPs) at a wide range of operating temperatures. Error Checking and Correction (ECC) schemes are popular in correcting randomly scattered single bit errors. Since retention failures also occur randomly, ECCs can be used to improve DRAM retention behavior. Therefore, for the first time, we show the influence of ECC on the retention behavior of recent DDR4 DRAMs, and how it varies across various DRAM architectures considering detailed structure of the DRAM (true-cell devices / mixed-cell devices). Download Paper (PDF; Only available from the DATE venue WiFi)
IP1-16	A BOOLEAN MODEL FOR DELAY FAULT TESTING OF EMERGING DIGITAL TECHNOLOGIES BASED ON AMBIPOLAR DEVICES Speaker: Davide Bertozzi, DE - University of Ferrara, IT Authors: Marcello Dalpasso¹, Davide Bertozzi² and Michele Favalli² ¹DEI - UNiv. of Padova, IT; ²DE - Univ. of Ferrara, IT Abstract Emerging nanotechnonologies such as ambipolar carbon nanotube field effect transistors (CNTFETs) and silicon nanowire FETs (SiNFETs) provide ambipolar devices allowing the design of more complex logic primitives than those found in today's typical CMOS libraries. When switching, such devices show a behavior not seen in simpler CMOS and FinFET cells, making unsuitable the existing delay fault testing approaches. We provide a Boolean model of switching ambipolar devices to support delay fault testing of logic cells based on such devices both in Boolean and Pseudo-Boolean satisfiability engines. Download Paper (PDF; Only available from the DATE venue WiFi)
IP1-17	ATPG POWER GUARDS: ON LIMITING THE TEST POWER BELOW THRESHOLD Speaker: Virendra Singh, Indian Institute of Technology Bombay, IN Authors: Rohini Gulve¹ and Virendra Singh² ¹Indian Institute of Technology Bombay, IN; ²IIT Bombay, IN Abstract Modern circuits with high performance and low power requirements impose strict constraints on manufacturing test generation, particularly on timing test. Delay test is used for performance grading of the circuit. During the application of the test, power consumption has to be less than the functional threshold value, in order to avoid yield loss. This work proposes a new direction to generate power safe test without any changes in DFT (design for testability) structure or existing CAD (computeraided design) tools. We propose a virtual wrapper circuitry around the circuit under test (CUT), for test generation purpose, which acts as a shield to obtain power safe vectors. The wrapper prohibits the generation of test vector if power consumption exceeds the threshold limits. We consider analytical power models for power analysis of candidate test vector patterns. Experiments performed on benchmark circuits show power safe test generation without coverage loss. Download Paper (PDF; Only available from the DATE venue WiFi)

Label

Presentation Title
Authors

IP1-1

RECOM: AN EFFICIENT RESISTIVE ACCELERATOR FOR COMPRESSED DEEP NEURAL NETWORKS
Speaker:
Houxiang Ji, Shanghai Jiao Tong University, CN
Authors:
Houxiang Ji¹, Linghao Song², Li Jiang¹, Hai (Helen) Li³ and Yiran Chen²
¹Shanghai Jiao Tong University, CN; ²Duke University, US; ³Duke University/TUM-IAS, US
Abstract
Deep Neural Networks (DNNs) play a key role in prevailing machine learning applications. Resistive random-access memory (ReRAM) is capable of both computation and storage, contributing to the acceleration on DNNs process in memory. Besides, DNNs have a significant amount of zero weights, which provides a possibility to reduce computation cost by skipping ineffectual calculations on zero weights. However, the irregular distribution of zero weights in DNNs makes it difficult for resistive accelerators to take advantage of the sparsity, because resistive accelerators have a high reliance on regular matrix-vector multiplication in ReRAM. In this work, we propose ReCom, the first resistive accelerator to support sparse DNN processing. ReCom is an efficient resistive accelerator for compressed deep neural networks, where DNN weights are structurally compressed to eliminate zero parameters and become more friendly to computation in ReRAM, and zero DNN activations are also considered at the same time. Two technologies, Structurally-compressed Weight Oriented Fetching (SWOF) and In-layer Pipeline for Memory and Computation (IPMC),are particularly proposed to efficiently process the compressed DNNs in ReRAM. In our evaluation, ReCom can achieve 3.37x speedup and 2.41x energy efficiency compared to a state-of-the-art resistive accelerator.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP1-2

SPARSENN: AN ENERGY-EFFICIENT NEURAL NETWORK ACCELERATOR EXPLOITING INPUT AND OUTPUT SPARSITY
Speaker:
Jingyang Zhu, Hong Kong University of Science and Technology, HK
Authors:
Jingyang Zhu, Jingbo Jiang, Xizi Chen and Chi-Ying Tsui, Hong Kong University of Science and Technology, HK
Abstract
Contemporary Deep Neural Network (DNN) contains millions of synaptic connections with tens to hundreds of layers. The large computational complexity poses a challenge to the hardware design. In this work, we leverage the intrinsic activation sparsity of DNN to substantially reduce the execution cycles and the energy consumption. An end-to-end training algorithm is proposed to develop a lightweight (less than 5% overhead) run-time predictor for the output activation sparsity on the fly. Furthermore, an energy-efficient hardware architecture, SparseNN, is proposed to exploit both the input and output sparsity. SparseNN is a scalable architecture with distributed memories and processing elements connected through a dedicated on-chip network. Compared with the state-of-the-art accelerators which only exploit the input sparsity, SparseNN can achieve a 10%-70% improvement in throughput and a power reduction of around 50%.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP1-3

ACCLIB: ACCELERATORS AS LIBRARIES
Speaker:
Jacob R. Stevens, Purdue University, US
Authors:
Jacob Stevens¹, Yue Du², Vivek Kozhikkottu³ and Anand Raghunathan¹
¹Purdue University, US; ²IBM, US; ³Intel Corporation, US
Abstract
Accelerator-based computing, which has been a mainstay of System-on-Chips (SoCs) is of growing interest to a wider range of computing systems. However, the significant design effort required to identify a computational target for acceleration, design a hardware accelerator, verify the correctness of the accelerator, integrate the accelerator into the system, and rewrite applications to use the accelerator, is a major bottleneck to the widespread adoption of accelerator-based computing. The classical approach to this problem is based on top-down methodologies such as automatic HW/SW partitioning and high-level synthesis (HLS). While HLS has advanced significantly and is seeing increased adoption, it does not leverage the ability of experienced human designers to craft highly optimized RTL, nor does it leverage the growing body of already existing hardware accelerators. In this work, we propose ACCLIB, a design framework that allows software developers to utilize existing libraries of pre-designed hardware accelerators automatically with no prior knowledge of the function of the accelerators, with minimal knowledge of hardware design, and with minimal design effort. To accomplish this, ACCLIB uses formal verification techniques to match a target software function with a functionally equivalent accelerator from a library of accelerators. It also generates the required HW/SW interfaces as well as the code necessary to offload the computation to the accelerator. We validate ACCLIB by applying it to accelerate six different applications using a library of hardware accelerators in just over one hour per application, demonstrating that the proposed approach has the potential to lower the barrier to adoption of accelerator-based computing.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP1-4

HPXA: A HIGHLY PARALLEL XML PARSER
Speaker:
Smruti Sarangi, IIT Delhi, IN
Authors:
Isaar Ahmad, Sanjog Patil and Smruti R. Sarangi, IIT Delhi, IN
Abstract
State of the art XML parsing approaches read an XML file byte by byte, and use complex finite state machines to process each byte. In this paper, we propose a new parser, HPXA, which reads and processes 16 bytes at a time. We designed most of the components ab initio, to ensure that they can process multiple XML tokens and tags in parallel. We propose two basic elements - a sparse 1D array compactor, and a hardware unit called LTMAdder that takes its decisions based on adding the rows of a lower triangular matrix. We demonstrate that we are able to process 16 bytes in parallel with very few pipeline stalls for a suite of widely used XML benchmarks. Moreover, for a 28nm technology node, we can process XML data at 106 Gbps, which is roughly 6.5X faster than competing prior work.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP1-5

QOR-AWARE POWER CAPPING FOR APPROXIMATE BIG DATA PROCESSING
Speaker:
Sherief Reda, Brown University, US
Authors:
Seyed Morteza Nabavinejad¹, Xin Zhan², Reza Azimi², Maziar Goudarzi¹ and Sherief Reda²
¹Sharif University of Technology, IR; ²Brown University, US
Abstract
To limit the peak power consumption of a cluster, a centralized power capping system typically assigns power caps to the individual servers, which are then enforced using local capping controllers. Consequently, the performance and throughput of the servers are affected, and the runtime of jobs is extended as a result. We observe that servers in big data processing clusters often execute big data applications that have different tolerance for approximate results. To mitigate the impact of power capping, we propose a new power-Capping aware resource manager for Approximate Big data processing (CAB) that takes into consideration the minimum Quality-of-Result (QoR) of the jobs. We use industry standard feedback power capping controllers to enforce a power cap quickly, while, simultaneously modifying the resource allocations to various jobs based on their progress rate, target minimum QoR, and the power cap such that the impact of capping on runtime is minimized. Based on the applied cap and the progress rates of jobs, CAB dynamically allocates the computing resources (i.e., number of cores and memory) to the jobs to mitigate the impact of capping on the finish time. We implement CAB in Hadoop-2.7.3 and evaluate its improvement over other methods on a state-of-the-art 28-core Xeon server. We demonstrate that CAB minimizes the impact of power capping on runtime by up to 39.4% while meeting the minimum QoR constraints.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP1-6

EXACT MULTI-OBJECTIVE DESIGN SPACE EXPLORATION USING ASPMT
Speaker:
Kai Neubauer, University of Rostock, DE
Authors:
Kai Neubauer¹, Philipp Wanko², Torsten Schaub² and Christian Haubelt¹
¹University of Rostock, DE; ²University of Potsdam, DE
Abstract
An efficient Design Space Exploration (DSE) is imperative for the design of modern, highly complex embedded systems in order to steer the development towards optimal design points. The early evaluation of design decisions at system-level abstraction layer helps to find promising regions for subsequent development steps in lower abstraction levels by diminishing the complexity of the search problem. In recent works, symbolic techniques, especially Answer Set Programming (ASP) modulo Theories (ASPmT), have been shown to find feasible solutions of highly complex system-level synthesis problems with non-linear constraints very efficiently. In this paper, we present a novel approach to a holistic system-level DSE based on ASPmT. To this end, we include additional background theories that concurrently guarantee compliance with hard constraints and perform the simultaneous optimization of several design objectives. We implement and compare our approach with a state-of-the-art preference handling framework for ASP. Experimental results indicate that our proposed method produces better solutions with respect to both diversity and convergence to the true Pareto front.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP1-7

HIPE: HMC INSTRUCTION PREDICATION EXTENSION APPLIED ON DATABASE PROCESSING
Speaker:
Diego Tomé, Centrum Wiskunde & Informatica (CWI), BR
Authors:
Diego Gomes Tomé¹, Paulo Cesar Santos², Luigi Carro², Eduardo Cunha de Almeida³ and Marco Antonio Zanata Alves³
¹Federal University of Paraná, BR; ²UFRGS, BR; ³UFPR, BR
Abstract
The recent Hybrid Memory Cube (HMC) is a smart memory which includes functional units inside one logic layer of the 3D stacked memory design. In order to execute instructions inside the Hybrid Memory Cube (HMC), the processor needs to send instructions to be executed near data, keeping most of the pipeline complexity inside the processor. Thus, control-flow and data-flow dependencies are all managed inside the processor, in such way that only update instructions are supported by the HMC. In order to solve data-flow dependencies inside the memory, previous work proposed HMC Instruction Vector Extensions(HIVE), which embeds a high number of functional units with an interlock register bank. In this work, we propose HMC Instruction Prediction Extensions (HIPE), that supports predicated execution inside the memory, in order to transform control-flow dependencies into data-flow dependencies. Our mechanism focuses on removing the high latency iteration between the processor and the smart memory during the execution of branches that depends on data processed inside the memory. In this paper, we evaluate a balanced design of HIVE comparing to x86 and HMC executions.After we show the HIPE mechanism results when executing a database workload, which is a strong candidate to use smart memories. We show interesting trade-offs of performance when comparing our mechanism to previous work.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP1-8

PARAMETRIC FAILURE MODELING AND YIELD ANALYSIS FOR STT-MRAM
Speaker:
Sarath Mohanachandran Nair, Karlsruhe Institute of Technology, DE
Authors:
Sarath Mohanachandran Nair, Rajendra Bishnoi and Mehdi Tahoori, Karlsruhe Institute of Technology, DE
Abstract
The emerging Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) is a promising candidate to replace conventional on-chip memory technologies due to its advantages such as non-volatility, high density, scalability and unlimited endurance. However, as the technology scales, yield loss due to extreme parametric variations is becoming a major challenge for STT-MRAM because of its higher sensitivity to process variations as compared to CMOS memories. In addition, the parametric variations in STT-MRAM exacerbates its stochastic switching behavior, leading to both test time fails and reliability failures in the field. Since an STT-MRAM memory array consists of both CMOS and magnetic components, it is important to consider variations in both these components to obtain the failures at the system level. In this work, we model the parametric failures of STT-MRAM at the system level considering the correlation among bit-cells as well as the impact of peripheral components. The proposed approach provides realistic fault distribution maps and equips the designer to investigate the efficacy of different combinations of defect tolerance techniques for an effective design-for-yield exploration.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP1-9

ONE-WAY SHARED MEMORY
Speaker and Author:
Martin Schoeberl, Technical University of Denmark, DK
Abstract
Standard multicore processors use the shared main memory via the on-chip caches for communication between cores. However, this form of communication has two limitations: (1) it is hardly time-predictable and therefore not a good solution for real-time systems and (2) this single shared memory is a bottleneck in the system. This paper presents a communication architecture for time-predictable multicore systems where core-local memories are distributed on the chip. A network-on-chip constantly copies data from a sender core-local memory to a receiver core-local memory. As this copying is performed in one direction we call this architecture a one-way shared memory. With the use of time-division multiplexing for the memory accesses and the network-on-chip routers we achieve a time-predictable solution where the communication latency and bandwidth can be bounded. An example architecture for a 3x3 core processor and 32-bit wide links and memory ports provides a cumulative bandwidth of 29 bytes per clock cycle. Furthermore, the evaluation shows that this architecture, due to its simplicity, is small compared to other network-on-chip solutions.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP1-10

AN EFFICIENT RESOURCE-OPTIMIZED LEARNING PREFETCHER FOR SOLID STATE DRIVES
Speaker:
Rui Xu, University of Science and Technology of China, CN
Authors:
Rui Xu, Xi Jin, Linfeng Tao, Shuaizhi Guo, Zikun Xiang and Teng Tian, Strongly-Coupled Quantum Matter Physics, Chinese Academy of Sciences, School of Physical Sciences, University of Science and Technology of China, Hefei, Anhui, China, CN
Abstract
In recent years, solid-state drives (SSDs) have been widely deployed in modern storage systems. To increase the performance of SSDs, prefetchers for SSDs have been designed both at operating system (OS) layer and flash translation layer (FTL). Prefetchers in FTL have many advantages like OS-independence, easy-using, and compatibility. However, due to the limitation of computing capabilities and memory resources, existing prefetchers in FTL merely employ simple sequential prefetching which may incur high penalty cost for I/O access stream with complex patterns. In this paper, an efficient learning prefetcher implemented in FTL is proposed. Considering the resource limitation of SSDs, a learning algorithm based on Markov chains is employed and optimized so that high hit ratio and low penalty cost can be achieved even for complex access patterns. To validate our design, a simulator with the prefetcher is designed and implemented based on Flashsim. The TPC-H benchmark and an application launch trace are tested on the simulator. According to experimental results of the TPC-H benchmark, more than 90% of memory cost can be saved in comparison with a previous design at OS layer. The hit ratio can be increased by 24.1% and the number of times of misprefetching can be reduced by 95.8% in comparison with the simple sequential prefetching strategy.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP1-11

BRIDGING DISCRETE AND CONTINUOUS TIME MODELS WITH ATOMS
Speaker:
George Ungureanu, KTH Royal Institute of Technology, SE
Authors:
George Ungureanu¹, José E. G. de Medeiros² and Ingo Sander¹
¹KTH Royal Institute of Technology, SE; ²University of Brasília, BR
Abstract
Recent trends in replacing traditionally digital components with analog counterparts in order to overcome physical limitations have led to an increasing need for rigorous modeling and simulation of hybrid systems. Combining the two domains under the same set of semantics is not straightforward and often leads to chaotic and non-deterministic behavior due to the lack of a common understanding of aspects concerning time. We propose an algebra of primitive interactions between continuous and discrete aspects of systems which enables their description within two orthogonal layers of computation. We show its benefits from the perspective of modeling and simulation, through the example of an RC oscillator modeled in a formal framework implementing this algebra.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP1-12

OHEX: OS-AWARE HYBRIDIZATION TECHNIQUES FOR ACCELERATING MPSOC FULL-SYSTEM SIMULATION
Speaker:
Róbert Lajos Bücs, Institute for Communication Technologies and Embedded Systems, RWTH Aachen University, DE
Authors:
Róbert Lajos Bücs¹, Maximilian Fricke², Rainer Leupers¹, Gerd Ascheid¹, Stephan Tobies² and Andreas Hoffmann²
¹Institute for Communication Technologies and Embedded Systems, RWTH Aachen University, DE; ²Synopsys GmbH, DE
Abstract
Virtual platform (VP) technology is an established enabler of embedded system design. However, the sheer number of CPU models in modern multi-core VPs forms a performance bottleneck. Hybrid simulation addresses this issue by executing parts of the embedded software stack on the host. Although the approach is significantly faster, hybridization can not cope with higher software layers, e.g., Operating Systems (OSs). Thus, this paper presents the OS-aware Host EXtension (OHEX) framework to accelerate VPs while expanding the applicability of hybridization. OHEX is evaluated on various system layers, yielding speedups between 2.99x-21.14x with specific benchmarks.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP1-13

A HIGHLY EFFICIENT FULL-SYSTEM VIRTUAL PROTOTYPE BASED ON VIRTUALIZATION-ASSISTED APPROACH
Speaker:
Hsin-I Wu, National Tsing Hua University, Department of Computer Science, Hsinchu, Taiwan, TW
Authors:
Hsin-I Wu, Chi-Kang Chen, Tsung-Ying Lu and Ren-Song Tsay, National Tsing Hua University, TW
Abstract
An effective full-system virtual prototype is critical for early-stage systems design exploration. Generally, however, traditional acceleration approaches of virtual prototypes cannot accurately analyze system performance and model non-deterministic inter-component interactions due to the unpredictability of simulation progress. In this paper, we propose an effective virtualization-assisted approach for modeling and performance analysis. First, we develop a deterministic synchronization process that manages the interactions affecting the data dependency in chronological order to model inter-component interactions consistently. Next, we create accurate timing and bus contention models based on runtime operation statistics for analyzing system performance. We implement the proposed virtualization-assisted approach on an off-the-shelf System-on-Chip (SoC) board to demonstrate the effectiveness of our idea. The experimental results show that the proposed approach runs 12~77 times faster than a commercial virtual prototyping tool and performance estimation is only 3~6% apart from real systems.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP1-14

INDUSTRIAL EVALUATION OF TRANSITION FAULT TESTING FOR COST EFFECTIVE OFFLINE ADAPTIVE VOLTAGE SCALING
Speaker:
Mahroo Zandrahimi, TU Delft, NL
Authors:
Mahroo Zandrahimi¹, Philippe Debaud², Armand Castillejo² and Zaid Al-Ars¹
¹Delft University of Technology, NL; ²STMicroelectronics, FR
Abstract
Adaptive voltage scaling (AVS) has been used widely to compensate for process, voltage, and temperature variations as well as power optimization of integrated circuits. The current industrial state-of-the-art AVS approaches using Process Monitoring Boxes (PMBs) have shown several limitations such as huge characterization effort, which makes these approaches very expensive, and a low accuracy that results in extra margins, which consequently lead to yield loss and performance limitations. To overcome those limitations, in this paper we propose an alternative solution using transition fault test patterns, which is able to eliminate the need for PMBs, while improving the accuracy of voltage estimation. The paper shows, using simulation of ISCAS'99 benchmarks with 28nm FD-SOI library, that AVS using transition fault testing (TF-based AVS) results in an error as low as 5.33%. The paper also shows that the PMB approach can only account for 85% of the uncertainty in voltage measurements, which results in power waste, while the TF-based approach can account for 99% of that uncertainty.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP1-15

AN ANALYSIS ON RETENTION ERROR BEHAVIOR AND POWER CONSUMPTION OF RECENT DDR4 DRAMS
Speaker:
Deepak M. Mathew, University of Kaiserslautern, DE
Authors:
Deepak M. Mathew¹, Martin Schultheis¹, Carl C. Rheinländer¹, Chirag Sudarshan¹, Matthias Jung², Christian Weis¹ and Norbert Wehn¹
¹University of Kaiserslautern, DE; ²Fraunhofer IESE, DE
Abstract
DRAM technology is scaling aggressively that results in high leakage power, worse data retention time behavior, and large process variations. Due to these process variations, vendors provide large guard bands on various DRAM currents and timing specifications that are over pessimistic. Detailed knowledge on the DRAM retention behavior and currents for the average case allow to improve memory system performance and energy efficiency of specific applications by moving away from worst case behavior. In this paper, we present an advanced measurement platform to investigate off-the-shelf DDR4 DRAMs' retention behavior, and to precisely measure various DRAM currents (IDDs and IPPs) at a wide range of operating temperatures. Error Checking and Correction (ECC) schemes are popular in correcting randomly scattered single bit errors. Since retention failures also occur randomly, ECCs can be used to improve DRAM retention behavior. Therefore, for the first time, we show the influence of ECC on the retention behavior of recent DDR4 DRAMs, and how it varies across various DRAM architectures considering detailed structure of the DRAM (true-cell devices / mixed-cell devices).
Download Paper (PDF; Only available from the DATE venue WiFi)

IP1-16

A BOOLEAN MODEL FOR DELAY FAULT TESTING OF EMERGING DIGITAL TECHNOLOGIES BASED ON AMBIPOLAR DEVICES
Speaker:
Davide Bertozzi, DE - University of Ferrara, IT
Authors:
Marcello Dalpasso¹, Davide Bertozzi² and Michele Favalli²
¹DEI - UNiv. of Padova, IT; ²DE - Univ. of Ferrara, IT
Abstract
Emerging nanotechnonologies such as ambipolar carbon nanotube field effect transistors (CNTFETs) and silicon nanowire FETs (SiNFETs) provide ambipolar devices allowing the design of more complex logic primitives than those found in today's typical CMOS libraries. When switching, such devices show a behavior not seen in simpler CMOS and FinFET cells, making unsuitable the existing delay fault testing approaches. We provide a Boolean model of switching ambipolar devices to support delay fault testing of logic cells based on such devices both in Boolean and Pseudo-Boolean satisfiability engines.
Download Paper (PDF; Only available from the DATE venue WiFi)

IP1-17

ATPG POWER GUARDS: ON LIMITING THE TEST POWER BELOW THRESHOLD
Speaker:
Virendra Singh, Indian Institute of Technology Bombay, IN
Authors:
Rohini Gulve¹ and Virendra Singh²
¹Indian Institute of Technology Bombay, IN; ²IIT Bombay, IN
Abstract
Modern circuits with high performance and low power requirements impose strict constraints on manufacturing test generation, particularly on timing test. Delay test is used for performance grading of the circuit. During the application of the test, power consumption has to be less than the functional threshold value, in order to avoid yield loss. This work proposes a new direction to generate power safe test without any changes in DFT (design for testability) structure or existing CAD (computeraided design) tools. We propose a virtual wrapper circuitry around the circuit under test (CUT), for test generation purpose, which acts as a shield to obtain power safe vectors. The wrapper prohibits the generation of test vector if power consumption exceeds the threshold limits. We consider analytical power models for power analysis of candidate test vector patterns. Experiments performed on benchmark circuits show power safe test generation without coverage loss.
Download Paper (PDF; Only available from the DATE venue WiFi)