IP3 Interactive Presentations

Printer-friendly version PDF version

Date: Wednesday 11 March 2015
Time: 16:00 - 16:30
Location / Room: Exhibition Area

Interactive Presentations run simultaneously during a 30-minute slot. A poster associated to the IP paper is on display throughout the afternoon. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session, prior to the actual Interactive Presentation. At the end of each afternoon Interactive Presentations session the award 'Best IP of the Day' is given.

LabelPresentation Title
Authors
IP3-1STT MRAM-BASED PUFS
Speakers:
Elena Ioana Vatajelu1, Giorgio Di Natale2, Marco Indaco1 and Paolo Prinetto1
1Politecnico di Torino, IT; 2LIRMM, FR
Abstract
Physical Unclonable Functions (PUFs) are emerging cryptographic primitives used to implement low-cost device authentication and secure secret key generation. Weak PUFs (i.e., devices able to generate a single signature or able to deal with a limited number of challenges) are widely discussed in literature. Nowadays, the most promising solution is based on SRAMs. In this paper we propose an innovative PUF design based on STT-MRAM memory. We exploit the high variability affecting the electrical resistance of the MTJ device in anti-parallel magnetization. We will show that the proposed solution is robust, unclonable and unpredictable.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-2SPATIAL AND TEMPORAL GRANULARITY LIMITS OF BODY BIASING IN UTBB-FDSOI
Speakers:
Johannes Maximilian Kühn1, Dustin Peterson1, Hideharu Amano2, Oliver Bringmann1 and Wolfgang Rosenstiel1
1Eberhard Karls Universität Tübingen, DE; 2Keio University, JP
Abstract
Advances in SOI technology such as STMicro's 28nm UTBB-FDSOI enabled a renaissance of body biasing. Body biasing is a fast and efficient technique to change power and performance characteristics. As the electrical task to change the substrate potential is small compared to Dynamic Voltage Scaling, much finer island sizes are conceivable. This however creates new challenges in regard to design partitioning into body bias islands and body bias combinations across such designs. These combinations should be chosen so that energy efficiency improves while maintaining timing constraints. We introduce a combination based analysis tool to find optimized body bias island partitions and body biasing levels. For such partitions, optimized body bias assignments for static, programmable and dynamic body biasing can be computed. The overheads incurred by dynamically switching body biases are estimated to yield actual improvements and to give an upper bound for the power consumption of required additional circuitry. Based on these partitionings and the switching overheads, optimized application specific switching strategies are computed. The effectiveness of this method is demonstrated in a frequency scaling scenario using forward body biasing on a Dynamic Reconfigurable Processor (DRP) design. We show that leakage can be greatly reduced using the proposed methods and that dynamic body biasing can be beneficial even at small time periods.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-3A HARDWARE IMPLEMENTATION OF A RADIAL BASIS FUNCTION NEURAL NETWORK USING STOCHASTIC LOGIC
Speakers:
Yuan Ji1, Feng Ran1, Cong Ma2 and David Lilja2
1Shanghai University, CN; 2University of Minnesota - Twin Cities, US
Abstract
Hardware implementations of artificial neural networks typically require significant amounts of hardware resources. This paper proposes a novel radial basis function artificial neural network using stochastic computing elements, which greatly reduces the required hardware. The Gaussian function used for the radial basis function is implemented with a two-dimensional finite state machine. The norm between the input data and the center point is optimized using simple logic gates. Results from two pattern recognition case studies, the standard Iris flower and the MICR font benchmarks, show that the difference of the average mean squared error between the proposed stochastic network and the corresponding traditional deterministic network is only 1.3% when the stochastic stream length is 10kbits. The accuracy of the recognition rate varies depending on the stream length, which gives the designer tremendous flexibility to tradeoff speed, power, and accuracy. From the FPGA implementation results, the hardware resource requirement of the proposed stochastic hidden neuron is only a few percent of the hardware requirement of the corresponding deterministic hidden neuron. The proposed stochastic network can be expanded to larger scale networks for complex tasks with simple hardware architectures.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-4SODA: SOFTWARE DEFINED FPGA BASED ACCELERATORS FOR BIG DATA
Speakers:
Chao Wang, Xi Li and Xuehai Zhou, University of Science and Technology of China, CN
Abstract
FPGA has been an emerging field in novel big data architectures and systems, due to its high efficiency and low power consumption. It enables the researchers to deploy massive accelerators within one single chip. In this paper, we present a software defined FPGA based accelerators for big data, named SODA, which could reconstruct and reorganize the acceleration engines according to the requirement of the various data-intensive applications. SODA decomposes large and complex applications into coarse grained single-purpose RTL code libraries that perform specialized tasks in out-of-order hardware. We built a prototyping system with constrained shortest path Finding (CSPF) case studies to evaluate SODA framework. SODA is able to achieve up to 43.75X speedup at 128 node application. Furthermore, hardware cost of the SODA framework demonstrates that it can achieve high speedup with moderate hardware utilization.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-5DYNAMIC RECONFIGURABLE PUNCTURING FOR SECURE WIRELESS COMMUNICATION
Speakers:
Liang Tang1, Jude Angelo Ambrose2, Akash Kumar1 and Sri Parameswaran2
1National University of Singapore, SG; 2University of New South Wales, AU
Abstract
The ubiquity of wireless devices has created security concerns on the information being transferred. It is critical to protect the secret information in every layer of wireless communication to thwart any type of attacks. A dynamic reconfigurable puncturing based security mechanism, named RePunc, is proposed in this paper to provide an extra level of security at the physical layer. RePunc utilizes the puncturing feature of Forward Error Correction (FEC) to insert the secure information in the punctured positions of the standard information encoded data. The punctured patterns are dynamically changed and passed as a secret key from the sender to the receiver. An eavesdropper will not be able to detect the transmission of the secure information since the inserted secure information will be processed as channel noise by the eavesdropper's receiver. However, the rightful receiver will be able to successfully decode the secure packets by knowingly differentiating the secure information and the standard information before the FEC decoding. A case study of RePunc implementation for WiFi communication is presented in this paper, showing the extreme high security complexity with low hardware overhead.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-6QR-DECOMPOSITION ARCHITECTURE BASED ON TWO-VARIABLE NUMERIC FUNCTION APPROXIMATION
Speakers:
Jochen Rust, Frank Ludwig and Steffen Paul, University of Bremen, DE
Abstract
This paper presents a new approach for hardware-based QR-decomposition using an efficient computation scheme of the Givens-Rotation. In detail, the angle of rotation and its application to the Givens-Matrix are processed in a direct, straight-forward manner. High-performance signal processing is achieved by piecewise approximation of the arctangent and sine function. In order to identify appropriate function approximations, several designs with varying constraints are automatically generated and analyzed. Physical and logical synthesis is performed in a 130nm CMOS-technology. The application of our proposal in a multi-antenna mobile communication scenario highlights our work to be very efficient in terms of calculation accuracy and computation performance.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-7IN-PLACE MEMORY MAPPING APPROACH FOR OPTIMIZED PARALLEL HARDWARE INTERLEAVER ARCHITECTURES
Speakers:
Saeed Ur Rehman1, Cyrille Chavet2, Philippe Coussy2 and Awais Sani1
1Lab-STICC / Université de Bretagne Sud, PK; 2Lab-STICC / Université de Bretagne Sud, FR
Abstract
Due to their impressive error correction performances, turbo-codes or LDPC (Low Density Parity Check) architectures are now widely used in communication system and are one of the most critical parts of decoders. In order to achieve high throughput requirements these decoders are based on parallel architecture, which results in a major problem to be solved: parallel memory access conflicts. To solve these conflicts, different approaches have been proposed in state of the art resulting in a lot of different architectural solutions. In this article, we introduce a new class of memory mapping approach that can solve the conflicts with an optimized architecture based on in-place memory mapping for any application.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-8MAXIMIZING COMMON IDLE TIME ON MULTI-CORE PROCESSORS WITH SHARED MEMORY
Speakers:
Chenchen Fu1, Yingchao Zhao2, Minming Li1 and Jason Xue3
1Department of Computer Science, City University of Hong Kong, HK; 2Department of Computer Science, Caritas Institute of Higher Education, Hong Kong, HK; 3City University of Hong Kong, HK
Abstract
Reducing energy consumption is a critical problem in most of the computing systems today. This paper focuses on reducing the energy consumption of the shared main memory in multi-core processors by putting it into sleep state when all the cores are idle. Based on this idea, this work presents systematic analysis of different assignment and scheduling models and proposes a series of scheduling schemes to maximize the common idle time of all cores. An optimal scheduling scheme is proposed assuming the number of cores is unbounded. When the number of cores is bounded, an efficient heuristic algorithm is proposed. The experimental results show that the heuristic algorithm works efficiently and can save as much as 25.6% memory energy compared to a conventional multi-core scheduling scheme.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-9MAXIMIZING IO PERFORMANCE VIA CONFLICT REDUCTION FOR FLASH MEMORY STORAGE SYSTEMS
Speakers:
Qiao Li1, Liang Shi2, Congming Gao1, Kaijie Wu1, Jason Chun Xue3, Qingfeng Zhuge1 and H.-M. Edwin Sha4
1Chongqing University, CN; 2College of Computer Science, Chongqing University, CN; 3City University of Hong Kong, HK; 4Chongqing University and University of Texas at Dallas, CN
Abstract
Flash memory has been widely deployed during the recent years with the improvement of bit density and technology scaling. However, a significant performance degradation is also introduced with the development trend. The latency of IO requests on flash memory storage systems is composed of access conflict latency, data transfer latency, flash chip access latency and ECC encoding/decoding latency. Studies show that the access conflict latency, which is mainly induced by the slow transfer latency and access latency, has become the dominate part of the IO latency, especially for IO intensive applications. This paper proposes to reduce the flash access conflict latency through the reduction of the transfer and flash access latencies. A latency model is built to construct the relationship among the transfer latency and access latency based on the reliability characteristics of flash memory. Simulation experiments show that the proposed approach achieves significant performance improvement.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-10A HYBRID PACKET/CIRCUIT-SWITCHED ROUTER TO ACCELERATE MEMORY ACCESS IN NOC-BASED CHIP MULTIPROCESSORS
Speakers:
Yassin Mazloumi and Mehdi Modarressi, University of Tehran, IR
Abstract
Modern chip multiprocessors will feature a large shared last-level cache (LLC) that is decomposed into smaller slices and physically distributed throughout the chip area. These architectures rely on a network-on-chip (NoC) to handle remote cache access and hence, NoCs play a critical role in optimizing memory access latency and power consumption. Circuit-switching is the most power- and performance-efficient switching mechanism in NoCs, but is not advantageous when the packet transmission time is not long enough compared to the circuit setup time. In this paper, we propose a zero-latency circuit setup scheme to make circuit-switching applicable in transferring individual data packets. The design leverages the fact that in CMPs with distributed LLC (where a considerable portion of the on-chip traffic is composed of remote LLC access requests and data responses), every response packet is sent in reply to a request packet and traverses the same path as its corresponding request, but at the backward direction. The short request packets, then, are responsible to reserve a path for their corresponding response packets. This NoC tries to reduce conflict among circuit paths by considering conflicts in backward direction during request packet routing, backed by a run-time technique to resolve conflicts when circuits are actually set up. Experimental results show that the proposed NoC architecture considerably reduces average packet latency that directly translates to faster memory access

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-11SEMIAUTOMATIC IMPLEMENTATION OF A BIOINSPIRED RELIABLE ANALOG TASK DISTRIBUTION ARCHITECTURE FOR MULTIPLE ANALOG CORES
Speakers:
Julius von Rosen1, Markus Meissner1 and Lars Hedrich2
1Goethe Universität Frankfurt, DE; 2Goethe-Universitat Frankfurt a. M., DE
Abstract
In this paper we present a silicon implementation of a bioinspired analog task distribution system for enabling reliable analog multi-core systems. The increase in reliability is achieved by a dependable task distribution architecture using a hormone based mechanism. The specifications are generated by a feasibility analysis of the algebraic description of the architecture. Starting from the specifications, an automated analog synthesis framework is used to fasten the time-consuming design of the needed analog amplifiers. The complete system with the designed amplifiers has been layouted and fabricated. We present measurements of two different architectures of task distribution system on silicon showing the full functionality of the system and the design methodology.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-12POWER-EFFICIENT ACCELERATOR ALLOCATION IN ADAPTIVE DARK SILICON MANY-CORE SYSTEMS
Speakers:
Muhammad Usman Karim Khan, Muhammad Shafique and Joerg Henkel, Karlsruhe Institute of Technology (KIT), DE
Abstract
Modern many-core systems in the dark silicon era face the predicament of underutilized resources of the chip due to power constraints. Therefore, hardware accelerators are becoming popular as they can overcome this problem by exercising a part of the program on dedicated custom logic in an energy efficient way. However, efficient accelerator usage poses numerous challenges, like adaptations for accelerator's sharing schedule on the many-core systems under run-time varying scenarios. In this work, we propose a power-efficient accelerator allocation scheme for adaptive many-core systems that maximally utilizes and dynamically allocates a shared accelerator to competing cores, such that deadlines of the executing applications are met and the total power consumption of the overall system is minimized. The experimental results demonstrate power minimization and high accelerator utilization for a many-core system.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-13THERMAL-AWARE FLOORPLANNING FOR PARTIALLY-RECONFIGURABLE FPGA-BASED SYSTEMS
Speakers:
Davide Pagano, Mikel Vuka, Marco Rabozzi, Riccardo Cattaneo, Donatella Sciuto and Marco D. Santambrogio, Politecnico di Milano, IT
Abstract
Field Programmable Gate Arrays (FPGAs) systems are being more and more frequent in high performance applications. Temperature affects both reliability and performance, therefore its optimization has become challenging for system designers. In this work we present a novel thermal aware floorplanner based on both Simulated Annealing (SA) and Mixed- Integer Linear Programming (MILP). The proposed method takes into account an accurate description of heterogeneous resources and partially reconfigurable constraints of recent FPGAs. Our major contribution is to provide a high level formulation for the problem, without resorting to low level consideration about FPGAs resources. Within our approach we combine the benefits of SA and MILP to handle both linear and non-linear optimization metrics while providing an effective exploration of the solution space. Experimental results show that, for several designs, it is possible to reduce the peak temperature by taking into account power consumption during the floorplanning stage.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-14FEEDBACK-BUS OSCILLATION RING: A GENERAL ARCHITECTURE FOR DELAY CHARACTERIZATION AND TEST OF INTERCONNECTS
Speakers:
Shi-Yu Huang1, Meng-Ting Tsai1, Kun-Han Tsai2 and Wu-Tung Cheng2
1National Tsing Hua University, TW; 2Mentor, US
Abstract
In this paper we propose a flexible delay characterization and test method for arbitrary die-to-die interconnects in a 3D IC. As compared to previous works, it is unique in its ability to streamline the characterization/test operations for a set of arbitrary interconnects with multiple pins sprawling multiple dies. During the Design-for-Testability stage, one common feedback-bus (connected to all dies in the IC under characterization/test) is inserted. Through the feedback-bus, a oscillation ring can be formed dynamically and the Variable-Output-Threshold (VOT) technique can be applied to characterize the delay of a selected interconnect segment at a time. Experimental results indicate that this method is not only flexible and scalable, but requiring only a small area overhead.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-15ANALOG NEUROMORPHIC COMPUTING ENABLED BY MULTI-GATE PROGRAMMABLE RESISTIVE DEVICES
Speakers:
Vehbi Calayir, Mohamed Darwish, Jeffrey Weldon and Larry Pileggi, Carnegie Mellon University, US
Abstract
Analog neural networks represent a massively parallel computing paradigm by mimicking the human brain. Two important functions that are not efficiently built by CMOS technology for their practical hardware implementations are weighting for synapse circuits and summing for neuron circuits. In this paper we propose the use of tunable analog resistances, such as multi-gate graphene devices, to efficiently enable these two functions. We design and demonstrate a complete analog neuromorphic circuitry enabled by such devices. Simulation results based on Verilog-A compact models for graphene devices confirm its functionality. We also provide experimental demonstration of our proposed graphene device along with projected circuit performance based on scaling targets. Our proposed design is suitable not only for the device example shown in this paper, but also for any beyond-CMOS technology that exhibits similar device characteristics.

Download Paper (PDF; Only available from the DATE venue WiFi)
IP3-16AN ENERGY-EFFICIENT NON-VOLATILE IN-MEMORY ACCELERATOR FOR SPARSE-REPRESENTATION BASED FACE RECOGNITION
Speakers:
Yuhao Wang1, Hantao Huang1, Leibin Ni1, Hao Yu1, Mei Yan1, Chuliang Weng2, Wei Yang2 and Junfeng Zhao2
1Nanyang Technological University, SG; 2Shannon Laboratory, Huawei Technologies Co., Ltd, CN
Abstract
Data analytics such as face recognition involves large volume of image data, and hence leads to grand challenge on mobile platform design with strict power requirement. Emerging non-volatile STT-MRAM has the minimum leakage power and comparable speed to SRAM, and hence is considered as a promising candidate for data-oriented mobile computing. However, there exists significantly higher write-energy for STT-MRAM when compared to the SRAM. Based on the use of STT- MRAM, this paper introduces an energy-efficient non-volatile in-memory accelerator for a sparse-representation based face recognition algorithm. We find that by projecting high-dimension image data to much lower dimension, the current scaling for STT-MRAM write operation can be applied aggressively, which leads to significant power reduction yet maintains quality-of-service for face recognition. Specifically, compared to a baseline with SRAM, leakage power and dynamic power are reduced by 91.4% and 79% respectively with only slight compromise on recognition rate.

Download Paper (PDF; Only available from the DATE venue WiFi)