IP3 Interactive Presentations

Printer-friendly version PDF version

Date: Wednesday 29 March 2017
Time: 16:00 - 16:30
Location / Room: IP sessions (in front of rooms 4A and 5A)

Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the afternoon. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award 'Best IP of the Day' is given.

LabelPresentation Title
Authors
IP3-1LEVERAGING AGING EFFECT TO IMPROVE SRAM-BASED TRUE RANDOM NUMBER GENERATORS
Speaker:
Mohammad Saber Golanbari, Karlsruhe Institute of Technology (KIT), DE
Authors:
Saman Kiamehr1, Mohammad Saber Golanbari2 and Mehdi Tahoori2
1Karlsruhe Institute of Technology (KIT), DE; 2Karlsruhe Institute of Technology, DE
Abstract
The start-up value of SRAM cells can be used as the random number vector or a seed for the generation of a pseudo random number. However, the randomness of the generated number is pretty low since many of the cells are largely skewed due to process variation and their start-up value leans toward zero or one. In this paper, we propose an approach to increase the randomness of SRAM-based True Random Number Generators (TRNGs) by leveraging transistor aging impact. The idea is to iteratively power-up the SRAM cells and put them under accelerated aging to make the cells less skewed and hence obtaining a more random vector. The simulation results show that the min-entropy of SRAM-based TRNG increases by 10X using this approach.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-2DESIGN AUTOMATION FOR OBFUSCATED CIRCUITS WITH MULTIPLE VIABLE FUNCTIONS
Speaker:
Shahrzad Keshavarz, University of Massachusetts Amherst, US
Authors:
Shahrzad Keshavarz1, Christof Paar2 and Daniel Holcomb1
1University of Massachusetts Amherst, US; 2Horst Gortz Institut for IT-Security, Ruhr-Universitat Bochum, DE
Abstract
Gate camouflaging is a technique for obfuscating the function of a circuit against reverse engineering attacks. However, if an adversary has pre-existing knowledge about the set of functions that are viable for an application, random camouflaging of gates will not obfuscate the function well. In this case, the adversary can target their search, and only needs to decide whether each of the viable functions could be implemented by the circuit. In this work, we propose a method for using camouflaged cells to obfuscate a design that has a known set of viable functions. The circuit produced by this method ensures that an adversary will not be able to rule out any viable functions unless she is able to uncover the gate functions of the camouflaged cells. Our method comprises iterated synthesis within an overall optimization loop to combine the viable functions, followed by technology mapping to deploy camouflaged cells while maintaining the plausibility of all viable functions. We evaluate our technique on cryptographic S-box functions and show that, relative to a baseline approach, it achieves up to 38% area reduction in PRESENT-style S-Boxes and 48% in DES S-boxes.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-3DOUBLE MAC: DOUBLING THE PERFORMANCE OF CONVOLUTIONAL NEURAL NETWORKS ON MODERN FPGAS
Speaker:
Jongeun Lee, UNIST, KR
Authors:
Dong Nguyen1, Daewoo Kim1 and Jongeun Lee2
1UNIST, KR; 2Ulsan National Institute of Science and Technology (UNIST), KR
Abstract
This paper presents a novel method to double the computation rate of convolutional neural network (CNN) accelerators by packing two multiply-and-accumulate (MAC) operations into one DSP block of off-the-shelf FPGAs (called Double MAC). While a general SIMD MAC using a single DSP block seems impossible, our solution is tailored for the kind of MAC operations required for a convolution layer. Our preliminary evaluation shows that not only can our Double MAC approach increase the computation throughput of a CNN layer by twice with essentially the same resource utilization, the network level performance can also be improved by 14~84% over a highly optimized state-of-the-art accelerator solution depending on the CNN hyper-parameters.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-4BITMAN: A TOOL AND API FOR FPGA BITSTREAM MANIPULATIONS
Speaker:
Dirk Koch, University of Manchester, GB
Authors:
Khoa Pham, Edson Horta and Dirk Koch, University of Manchester, GB
Abstract
To fully support the partial reconfiguration capabilities of FPGAs, this paper introduces the tool and API BitMan for generating and manipulating configuration bitstreams. BitMan supports recent Xilinx FPGAs that can be used by the ISE and Vivado tool suites of the FPGA vendor Xilinx, including latest Virtex-6, 7 Series, UltraScale and UltraScale+ series FPGAs. The functionality includes high-level commands such as cutting out regions of a bitstream and placing or relocating modules on an FPGA as well as low-level commands for modifying primitives and for routing clock networks or rerouting signal connections at run-time. All this is possible without the vendor CAD tools for allowing BitMan to be used even with embedded CPUs. The paper describes the capabilities, API and performance evaluation of BitMan.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-5A GENERIC TOPOLOGY SELECTION METHOD FOR ANALOG CIRCUITS WITH EMBEDDED CIRCUIT SIZING DEMONSTRATED ON THE OTA EXAMPLE
Speaker:
Andreas Gerlach, Robert Bosch Centre for Power Electronics, DE
Authors:
Andreas Gerlach1, Thoralf Rosahl2, Frank-Thomas Eitrich2 and Jürgen Scheible1
1Robert Bosch Centre for Power Electronics, DE; 2Robert Bosch GmbH, DE
Abstract
We present a methodology for automatic selection and sizing of analog circuits demonstrated on the OTA circuit class. The methodology consists of two steps: a generic topology selection method supported by a "part-sizing" process and subsequent final sizing. The circuit topologies provided by a reuse library are classified in a topology tree. The appropriate topology is selected by traversing the topology tree starting at the root node. The decision at each node is gained from the result of the part-sizing, which is in fact a node-specific set of simulations. The final sizing is a simulation-based optimization. We significantly reduce the overall simulation effort compared to a classical simulation-based optimization by combining the topology selection with the part-sizing process in the selection loop. The result is an interactive user friendly system, which eases the analog designer's work significantly when compared to typical industrial practice in analog circuit design. The topology selection method with sizing is implemented as a tool into a typical analog design environment.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-6LATENCY ANALYSIS OF HOMOGENEOUS SYNCHRONOUS DATAFLOW GRAPHS USING TIMED AUTOMATA
Speaker:
Guus Kuiper, University of Twente, NL
Authors:
Guus Kuiper1 and Marco Bekooij2
1University of Twente, NL; 2University of Twente + NXP semiconductors, NL
Abstract
There are several analysis models and corresponding temporal analysis techniques for checking whether applications executed on multiprocessor systems meet their real-time constraints. However, currently there does not exist an exact end-to-end latency analysis technique for Homogeneous Synchronous Dataflow (HSDF) with Auto-concurrency (HSDFa) models that takes the correlation between the firing durations of different firings into account. In this paper we present a transformation of strongly connected (HSDFa) models into timed automata models. This enables an exact end-to-end latency analysis because the correlation between the firing durations of different firings is taken into account. In a case study we compare the latency obtained using timed automata and a Linear Program (LP) based analysis technique that relies on a deterministic abstraction and compare their run-times as well. Exact end-to-end latency analysis results are obtained using timed automata, whereas this is not possible using deterministic timed-dataflow models.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-7COVERT: COUNTER OVERFLOW REDUCTION FOR EFFICIENT ENCRYPTION OF NON-VOLATILE MEMORIES
Speaker:
Kartik Mohanram, ECE Dept, University of Pittsburgh, US
Authors:
Shivam Swami and Kartik Mohanram, University of Pittsburgh, US
Abstract
Security vulnerabilities arising from data persistence in emerging non-volatile memories (NVMs) necessitate memory encryption to ensure data security. Whereas counter mode encryption (CME) is a stop-gap practical approach to address this concern, it suffers from frequent memory re-encryption (system freeze) for small-sized counters and poor system performance for large-sized counters. CME thus imposes heavy overheads on memory, system performance, and system availability in practice. We propose Counter OVErflow ReducTion (COVERT), a CME-based memory encryption solution that performs on-demand memory allocation to reduce the memory encryption frequency of fast growing counters, while also retaining the area/performance benefits of small-sized counters. Our full-system simulations of a phase change memory (PCM) architecture across SPEC CPU2006 benchmarks show that for equivalent overhead and no impact to performance, COVERT simultaneously reduces the full memory re-encryption frequency from 6 minutes to 25 hours and doubles memory lifetime in comparison to state-of-the-art CME techniques.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-8A WEAR-LEVELING-AWARE COUNTER MODE FOR DATA ENCRYPTION IN NON-VOLATILE MEMORIES
Speaker:
Fangting Huang, Huazhong University of Science and Technology, CN
Authors:
Fangting Huang1, Dan Feng2, Yu Hua2 and Wen Zhou2
1Huazhong University of Science and Technology, CN; 2Wuhan National Lab for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, China, CN
Abstract
Counter-mode encryption has been widely used to resist NVMs from malicious attacks, due to its proved security and high performance. However, this scheme suffers from the counter size versus re-encryption problem, where per-line counters must be relatively large to avoid counter overflow, or re-encryption of the entire memory is required to ensure security. In order to address this problem, we propose a novel wear-leveling-aware counter mode for data encryption, called Resetting Counter via Remapping (RCR). The basic idea behind RCR is to leverage wear-leveling remappings to reset the line counter. With carefully designed procedure, RCR avoids counter overflow with much smaller counter size. The salient features of RCR include low storage overhead of counters, high counter cache hit ratio, and no extra re-encryption overhead. Compared with state-of-the-art works, RCR obtains significant performance improvements, e.g., up to a 57% reduction in the IPC degradation, under the evaluation of 8 memory-intensive benchmarks from SPEC 2006.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-9(Best Paper Award Candidate)
TUNNEL FET BASED REFRESH-FREE-DRAM
Speaker:
Navneet Gupta, ISEP-Paris, FR
Authors:
Navneet Gupta1, Adam Makosiej2, Andrei Vladimirescu3, Amara Amara3 and Costin Anghel3
1Institut supérieur d'électronique de Paris, France; LETI, Commissariat à l’Energie Atomique et aux Energies Alternatives (CEA-Leti) France;, FR; 2LETI, Commissariat à l’Energie Atomique et aux Energies Alternatives (CEA-Leti), FR; 3Institut Superieur d'Electronique de Paris (ISEP), FR
Abstract
A refresh free and scalable ultimate DRAM (uDRAM) bitcell and architecture is proposed for embedded application. uDRAM 1T1C bitcell is designed using access Tunnel FETs. Proposed design is able to store the data statically during retention eliminating the need for refresh. This is achieved using negative differential resistance property of TFETs and storage capacitor leakage. uDRAM allows scaling of storage capacitor by 87% and 80% in comparison to DDR and eDRAMs, respectively. Implemented design have sub-array read/write access times of < 4ns. Bitcell area of 0.0275μm2 is achieved in 28nm FDSOI-CMOS and is scalable further with technology shrink. Estimated throughput gain is 3.8% to 18% in comparison to CMOS DRAMs by refresh removal.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-10A HARDWARE IMPLEMENTATION OF THE MCAS SYNCHRONIZATION PRIMITIVE
Speaker:
Smruti Sarangi, IIT Delhi, IN
Authors:
Srishty Patel, Rajshekar Kalayappan, Ishani Mahajan and Smruti R. Sarangi, IIT Delhi, IN
Abstract
Lock-based parallel programs are easy to write. However, they are inherently slow as the synchronization is blocking in nature. Non-blocking lock-free programs, which use atomic instructions such as compare-and-set (CAS), are significantly faster. However, lock-free programs are notoriously difficult to design and debug. This can be greatly eased if the primitives work on multiple memory locations instead of one. We propose MCAS, a hardware implementation of a multi-word compare-and-set primitive. Ease of programming aside, MCAS- based programs are 13.8X and 4X faster on an average than lock-based and traditional lock-free programs respectively. The area overhead, in a 32-core 400mm2 chip, is a mere 0.046%.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-11BANDITS: DYNAMIC TIMING SPECULATION USING MULTI-ARMED BANDIT BASED OPTIMIZATION
Speaker:
Jeff Zhang, New York University, US
Authors:
Jeff Zhang and Siddharth Garg, New York University, US
Abstract
Timing speculation has recently been proposed as a method for increasing performance beyond that achievable by conventional worst-case design techniques. Starting with the observation of fast temporal variations in timing error probabilities, we propose a run-time technique to dynamically determine the optimal degree of timing speculation (i.e., how aggressively the processor is over-clocked) based on a novel formulation of the dynamic timing speculation problem as a multi-armed bandit problem. By conducting detailed post-synthesis timing simulations on a 5-stage MIPS processor running a variety of workloads, the proposed adaptive mechanism improves processor's performance significantly comparing with a competing approach (about 8.3% improvement); on the other hand, it shows only about 2.8% performance loss on average, compared with the oracle results.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-12DESIGN AND IMPLEMENTATION OF A FAIR CREDIT-BASED BANDWIDTH SHARING SCHEME FOR BUSES
Speaker:
Carles Hernandez, Barcelona Supercomputing Center (BSC), ES
Authors:
Mladen Slijepcevic1, Carles Hernandez2, Jaume Abella3 and Francisco Cazorla4
1Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, ES; 2Barcelona Supercomputing Center, ES; 3Barcelona Supercomputing Center (BSC-CNS), ES; 4Barcelona Supercomputing Center and IIIA-CSIC, ES
Abstract
Fair arbitration in the access to hardware shared resources is fundamental to obtain low worst-case execution time (WCET) estimates in the context of critical real-time systems, for which performance guarantees are essential. Several hardware mechanisms exist for managing arbitration in those resources (buses, memory controllers, etc.). They typically attain fairness in terms of the number of slots each contender (e.g., core) gets granted access to the shared resource. However, those policies may lead to unfair bandwidth allocations for workloads with contenders issuing short requests and contenders issuing long requests. We propose a Credit-Based Arbitration (CBA) mechanism that achieves fairness in the cycles each core is granted access to the resource rather than in the number of granted slots. Furthermore, we implement CBA as part of a LEON3 4-core processor for the Space domain in an FPGA proving the feasibility and good performance characteristics of the design by comparing it against other arbitration schemes.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-13TECHNOLOGY MAPPING WITH ALL SPIN LOGIC
Speaker:
Azadeh Davoodi, University of Wisconsin - Madison, US
Authors:
Boyu Zhang1 and Azadeh Davoodi2
1University of Wisconsin-Madison, US; 2University of Wisconsin - Madison, US
Abstract
This work is the first to propose a technology mapping algorithm for All Spin Logic (ASL) device. The ASL device is the most actively-pursed one among spintronics devices which themselves fall under emerging post-CMOS nano-technologies. We identify the shortcomings of directly applying the classical technology mapping with ASL devices, and propose techniques to extend the classical procedure to handle these shortcomings. Our results show that our ASL-aware technology mapping algorithm can achieve on-average 9.15% and up to 27.27% improvement in delay (when optimizing delay) with slight improvement in area, compared to the solution generated by classical technology mapping. In a broader sense, our results show the need for developing circuit-level CAD tools that are aware of and optimized for emerging technologies in order to better assess their promise as we move to the post-CMOS era.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-14A NEW METHOD TO IDENTIFY THRESHOLD LOGIC FUNCTIONS
Speaker:
Spyros Tragoudas, Southern Illinois University Carbondale, US
Authors:
Seyed Nima Mozaffari, Spyros Tragoudas and Themistoklis Haniotakis, Southern Illinois University, US
Abstract
An Integer Linear Programming based method to identify current mode threshold logic functions is presented. The approach minimizes the transistor count and benefits from a generalized definition of threshold logic functions. Process variations are taken into consideration. Experimental results show that many more functions can be implemented with predetermined hardware overhead, and the hardware requirement of a large percentage of existing threshold functions is reduced.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-15A BRIDGING FAULT MODEL FOR LINE COVERAGE IN THE PRESENCE OF UNDETECTED TRANSITION FAULTS
Speaker and Author:
Irith Pomeranz, Purdue University, US
Abstract
A variety of fault models have been defined to capture the behaviors of commonly occurring defects and ensure a high quality of testing. When several fault models are used for test generation, it is advantageous if the existence of an undetectable fault in one model does not imply that a fault in the same component but from a different model is also undetectable. This allows a test set to cover the circuit more thoroughly when additional fault models are used. This paper studies the possibility of defining such fault models by considering transition faults as the first fault model, and bridging faults as the second fault model. The bridging faults are defined to cover lines for which transition faults are not detected. A test compaction procedure is developed to demonstrate the bridging fault coverage that can be achieved, and the effect on the number of tests.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-16CHRT: A CRITICALITY- AND HETEROGENEITY-AWARE RUNTIME SYSTEM FOR TASK-PARALLEL APPLICATIONS
Speaker:
Myeonggyun Han, UNIST, KR
Authors:
Myeonggyun Han, Jinsu Park and Woongki Baek, UNIST, KR
Abstract
Heterogeneous multiprocessing (HMP) is an emerging technology for high-performance and energy-efficient computing. While task parallelism is widely used in various computing domains from the embedded to machine-learning computing domains, relatively little work has been done to investigate the efficient runtime support that effectively utilizes the criticality of the tasks of the target application and the heterogeneity of the underlying HMP system with full resource management. To bridge this gap, we propose a criticality- and heterogeneity-aware runtime system for task-parallel applications (CHRT). CHRT dynamically estimates the performance and power consumption of the target task-parallel application and robustly manages the full HMP system resources (i.e., core types, counts, and voltage/frequency levels) to maximize the overall efficiency. Our experimental results show that CHRT achieves significantly higher energy efficiency than the baseline runtime system that employs the breadth-first scheduler and the state-of-the-art criticality-aware runtime system.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-17MOBIXEN: PORTING XEN ON ANDROID DEVICES FOR MOBILE VIRTUALIZATION
Speaker:
Jianguo Yao, Shanghai Jiao Tong University, CN
Authors:
Yaozu Dong1, Jianguo Yao2, Haibing Guan2, Ananth. Krishna R1 and Yunhong Jiang1
1Intel, US; 2Shanghai Jiao Tong University, CN
Abstract
The mobile virtualization technology provides a feasible way to improve the manageability and security for embedded systems. This paper presents an architecture named MobiXen to address these challenges. In the MobiXen, both Xen's physical memory space and virtual address space are shrunk as much as possible and thus Android owns more memory resource; optimizations are developed to reduce the virtualization overhead when Android is accessing system resources; new policies are implemented to achieve low suspend/resume latency. With these work adopted, MobiXen is customized as a high efficient mobile hypervisor. Detailed implementations shows that, most of the performance degradation brought by MobiXen is less than 3\%, which is imperceptible by end users.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-18OPTIMISATION OPPORTUNITIES AND EVALUATION FOR GPGPU APPLICATIONS ON LOW-END MOBILE GPUS
Speaker:
Leonidas Kosmidis, Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES
Authors:
Matina Maria Trompouki1 and Leonidas Kosmidis2
1Universitat Politècnica de Catalunya, ES; 2Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES
Abstract
Previous works in the literature have shown the feasibility of general purpose computations for non-visual applications on low-end mobile graphics processors using graphics APIs. These works focused only on the functional aspects of the software, ignoring the implementation details and therefore their performance implications due to their particular micro-architecture. Since various steps in such applications can be implemented in multiple ways, we identify optimisation opportunities, explore the different options and evaluate them. We show that the implementation details can significantly affect the obtained performance with discrepancies up to 3 orders of magnitude and we demonstrate the effectiveness of our proposal on two embedded platforms, obtaining more than 16x speedup over benchmarks designed following OpenGL ES 2 best practices.

Download Paper (PDF; Only available from the DATE venue WiFi)