8.4 Efficient and reliable memory and computing architectures

Time	Label	Presentation Title Authors
17:00	8.4.1	(Best Paper Award Candidate) HYVE: HYBRID VERTEX-EDGE MEMORY HIERARCHY FOR ENERGY-EFFICIENT GRAPH PROCESSING Speaker: Tianhao Huang, Tsinghua University, CN Authors: Tianhao Huang, Guohao Dai, Yu Wang and Huazhong Yang, Tsinghua University, CN Abstract High energy consumption of conventional memory modules (e.g. DRAMs) hinders the further improvement of large-scale graph processing's energy efficiency. The emerging metal-oxide resistive random-access memory (ReRAM) and ReRAM crossbar have shown great potential in providing the energy-efficient memory module. However, the performance of ReRAMs suffers from data access patterns with poor locality and a large amount of written data, which are common in graph processing. In this paper, we propose a Hybrid Vertex-Edge memory hierarchy, HyVE, to avoid random access and data written to ReRAM modules. With data allocation and scheduling over vertices and edges, HyVE reduces memory energy consumption by 69% compared with conventional memory system in graph processing. Moreover, we adopt bank level power-gating scheme to further reduce the stand-by power. Our evaluations show that the optimized design achieve at least 2.0x improvement of energy efficiency against DRAM-based design. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30	8.4.2	ACCURATE NEURON RESILIENCE PREDICTION FOR A FLEXIBLE RELIABILITY MANAGEMENT IN NEURAL NETWORK ACCELERATORS Speaker: Christoph Schorn, Robert Bosch GmbH, DE Authors: Christoph Schorn¹, Andre Guntoro¹ and Gerd Ascheid² ¹Robert Bosch GmbH, DE; ²RWTH Aachen University, DE Abstract Deep neural networks have become a ubiquitous tool for mastering complex classification tasks. Current research focuses on the development of power-efficient and fast neural network hardware accelerators for mobile and embedded devices. However, when used in safety-critical applications, for example autonomously operating vehicles, the reliability of such accelerators becomes a further optimization criterion which can stand in contrast to power-efficiency and latency. Furthermore, ensuring hardware reliability becomes increasingly challenging for shrinking structure widths and rising power densities in the nanometer semiconductor technology era. One solution to this challenge is the exploitation of fault tolerant parts in deep neural networks. In this paper we propose a new method for predicting the error resilience of neurons in deep neural networks and show that this method significantly improves upon existing methods in terms of accuracy as well as interpretability. We evaluate prediction accuracy by simulating hardware faults in networks trained on the CIFAR-10 and ILSVRC image classification benchmarks and protecting neurons according to the resilience estimations. In addition, we demonstrate how our resilience prediction can be used for a flexible trade-off between reliability and efficiency in neural network hardware accelerators. Download Paper (PDF; Only available from the DATE venue WiFi)
18:00	8.4.3	RAPID IN-MEMORY MATRIX MULTIPLICATION USING ASSOCIATIVE PROCESSOR Speaker: Hasan Erdem Yantir, University of California Irvine, US Authors: Neggaz Mohamed Ayoub¹, Hasan Erdem Yantır², Smail Niar³, Ahmed Eltawil² and Fadi Kurdahi² ¹University of Valenciennes, FR; ²University of California, Irvine, US; ³LAMIH-University of Valenciennes, FR Abstract Memory hierarchy latency is one of the main problems that prevents processors from achieving high performance. To eliminate the need of loading/storing large sets of data, Resistive Associative Processors (ReAP) have been proposed as a solution to the von Neumann bottleneck. In ReAPs, logic and memory structures are combined together to allow in-memory computations. In this paper, we propose a new algorithm to compute the matrix multiplication inside the memory that exploits the benefits of ReAP. The proposed approach is based on the Cannon algorithm and uses a series of rotations without duplicating the data. It runs in O(n), where n is the dimension of the matrix. The method also applies to a large set of row by column matrix-based applications. Experimental results show several orders of magnitude increase in performance and reduction in energy and area when compared to the latest FPGA and CPU implementations. Download Paper (PDF; Only available from the DATE venue WiFi)
18:15	8.4.4	HIMAP: A HIERARCHICAL MAPPING APPROACH FOR ENHANCING LIFETIME RELIABILITY OF DARK SILICON MANYCORE SYSTEMS Speaker: Vivek Chaturvedi, Nanyang Technological University, SG Authors: Vijeta Rathore¹, Vivek Chaturvedi¹, Amit Kumar Singh², Thambipillai Srikanthan¹, Rohith R¹, Siew Kei Lam¹ and Muhammad Shafique³ ¹Nanyang Technological University, SG; ²University of Essex, GB; ³TU Wien, AT Abstract Technology scaling into the nano-scale CMOS regime has resulted in increased leakage and roadblock on voltage scaling, which has led to several issues like high power density and elevated on-chip temperature. This consequently aggravates device aging, compromising lifetime reliability of the manycore systems. This paper proposes extit{HiMap}, a dynamic hierarchical mapping approach to maximize lifetime reliability of manycore systems while satisfying performance, power, and thermal constraints. HiMap is process variation- and aging-aware. It comprises of two levels: (1) it identifies a region of cores suitable for mapping, and (2) it maps threads in the region and intersperses dark cores for thermal mitigation while considering the current health of the cores. Both the levels strive to reduce aging variance across the chip. We evaluated HiMap for 64-core and 256-core systems. Results demonstrate an improved system lifetime reliability by up to 2 years at the end of 3.25 years of use, as compared to the state-of-the-art. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30	IP3-16, 906	DEMAS: AN EFFICIENT DESIGN METHODOLOGY FOR BUILDING APPROXIMATE ADDERS FOR FPGA-BASED SYSTEMS Speaker: Semeen Rehman, Vienna University of Technology (TU Wien), AT Authors: Bharath Srinivas Prabakaran¹, Semeen Rehman¹, Muhammad Abdullah Hanif¹, Salim Ullah², Ghazal Mazaheri³, Akash Kumar² and Muhammad Shafique¹ ¹TU Wien, AT; ²Technische Universität Dresden, DE; ³UC Riverside, US Abstract The current state-of-the-art approximate adders are mostly ASIC-based, i.e., they focus solely on gate and/or transistor level approximations (e.g., through circuit simplification or truncation) to achieve area, latency, power and/or energy savings at the cost of accuracy loss. However, when these designs are synthesized for FPGA-based systems, they do not offer similar reductions in area, latency and power/energy due to the underlying architectural differences between ASICs and FPGAs. In this paper, we present a novel generic design methodology to synthesize and implement approximate adders for any FPGA-based system by considering the underlying resources and architectural differences. Using our methodology, we have designed, analyzed and presented eight different multi-bit adder architectures. Compared to the 16-bit accurate adder, our designs are successful in achieving area, latency and power-delay product gains of 50%, 38%, and 53%, respectively. We also compare our approximate adders to state-of-the-art approximate adders specialized for ASIC and FPGA fabrics and demonstrate the benefits of our approach. We will make the RTL and behavioral models of our and state-of-the-art designs open-source at https://sourceforge.net/projects/approxfpgas/ to further fuel the research and development in the FPGA community and to ensure reproducible research. Download Paper (PDF; Only available from the DATE venue WiFi)
18:31	IP3-17, 515	GAIN SCHEDULED CONTROL FOR NONLINEAR POWER MANAGEMENT IN CMPS Speaker: Nikil Dutt, University of California, Irvine, US Authors: Bryan Donyanavard, Amir M. Rahmani, Tiago Muck, Kasra Moazzemi and Nikil Dutt, University of California, Irvine, US Abstract Dynamic voltage and frequency scaling (DVFS) is a well-established technique for power management of thermal- or energy-sensitive chip multiprocessors (CMPs). In this context, linear control theoretic solutions have been successfully implemented to control the voltage-frequency knobs. However, modern CMPs with a large range of operating frequencies and multiple voltage levels display nonlinear behavior in the relationship between frequency and power. State-of-the-art linear controllers therefore leave room for opportunity in optimizing DVFS operation. We propose a Gain Scheduled Controller (GSC) for nonlinear runtime power management of CMPs that simplifies the controller implementation of systems with varying dynamic properties by utilizing an adaptive control theoretic approach in conjunction with static linear controllers. Our design improves the stability, accuracy, settling time, and overshoot of the controller over a linear controller with minimal overhead. We implement our approach on an Exynos platform containing ARM's big.LITTLE-based heterogeneous multi-processor (HMP) and demonstrate that the system's response to changes in target power is improved by 2x while operating up to 12% more efficiently. Download Paper (PDF; Only available from the DATE venue WiFi)
18:32	IP4-1, 376	EFFICIENT MAPPING OF QUANTUM CIRCUITS TO THE IBM QX ARCHITECTURES Speaker: Alwin Zulehner, Johannes Kepler University Linz, AT Authors: Alwin Zulehner, Alexandru Paler and Robert Wille, Johannes Kepler University Linz, AT Abstract In March 2017, IBM launched the project IBM Q with the goal to provide access to quantum computers for a broad audience. This allowed users to conduct quantum experiments on a 5-qubit and, since June 2017, also on a 16-qubit quantum computer (called IBM QX2 and IBM QX3, respectively). In order to use these, the desired quantum functionality (e.g. provided in terms of a quantum circuit) has to properly be mapped so that the underlying physical constraints are satisfied - a complex task. This demands for solutions to automatically and efficiently conduct this mapping process. In this paper, we propose such an approach which satisfies all constraints given by the architecture and, at the same time, aims to keep the overhead in terms of additionally required quantum gates minimal. The proposed approach is generic and can easily be configured for future architectures. Experimental evaluations show that the proposed approach clearly outperforms IBM's own mapping solution with respect to runtime as well as resulting costs. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30		End of session

Time

Label

Presentation Title
Authors

17:00

8.4.1

(Best Paper Award Candidate)
HYVE: HYBRID VERTEX-EDGE MEMORY HIERARCHY FOR ENERGY-EFFICIENT GRAPH PROCESSING
Speaker:
Tianhao Huang, Tsinghua University, CN
Authors:
Tianhao Huang, Guohao Dai, Yu Wang and Huazhong Yang, Tsinghua University, CN
Abstract
High energy consumption of conventional memory modules (e.g. DRAMs) hinders the further improvement of large-scale graph processing's energy efficiency. The emerging metal-oxide resistive random-access memory (ReRAM) and ReRAM crossbar have shown great potential in providing the energy-efficient memory module. However, the performance of ReRAMs suffers from data access patterns with poor locality and a large amount of written data, which are common in graph processing. In this paper, we propose a Hybrid Vertex-Edge memory hierarchy, HyVE, to avoid random access and data written to ReRAM modules. With data allocation and scheduling over vertices and edges, HyVE reduces memory energy consumption by 69% compared with conventional memory system in graph processing. Moreover, we adopt bank level power-gating scheme to further reduce the stand-by power. Our evaluations show that the optimized design achieve at least 2.0x improvement of energy efficiency against DRAM-based design.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

8.4.2

ACCURATE NEURON RESILIENCE PREDICTION FOR A FLEXIBLE RELIABILITY MANAGEMENT IN NEURAL NETWORK ACCELERATORS
Speaker:
Christoph Schorn, Robert Bosch GmbH, DE
Authors:
Christoph Schorn¹, Andre Guntoro¹ and Gerd Ascheid²
¹Robert Bosch GmbH, DE; ²RWTH Aachen University, DE
Abstract
Deep neural networks have become a ubiquitous tool for mastering complex classification tasks. Current research focuses on the development of power-efficient and fast neural network hardware accelerators for mobile and embedded devices. However, when used in safety-critical applications, for example autonomously operating vehicles, the reliability of such accelerators becomes a further optimization criterion which can stand in contrast to power-efficiency and latency. Furthermore, ensuring hardware reliability becomes increasingly challenging for shrinking structure widths and rising power densities in the nanometer semiconductor technology era. One solution to this challenge is the exploitation of fault tolerant parts in deep neural networks. In this paper we propose a new method for predicting the error resilience of neurons in deep neural networks and show that this method significantly improves upon existing methods in terms of accuracy as well as interpretability. We evaluate prediction accuracy by simulating hardware faults in networks trained on the CIFAR-10 and ILSVRC image classification benchmarks and protecting neurons according to the resilience estimations. In addition, we demonstrate how our resilience prediction can be used for a flexible trade-off between reliability and efficiency in neural network hardware accelerators.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:00

8.4.3

RAPID IN-MEMORY MATRIX MULTIPLICATION USING ASSOCIATIVE PROCESSOR
Speaker:
Hasan Erdem Yantir, University of California Irvine, US
Authors:
Neggaz Mohamed Ayoub¹, Hasan Erdem Yantır², Smail Niar³, Ahmed Eltawil² and Fadi Kurdahi²
¹University of Valenciennes, FR; ²University of California, Irvine, US; ³LAMIH-University of Valenciennes, FR
Abstract
Memory hierarchy latency is one of the main problems that prevents processors from achieving high performance. To eliminate the need of loading/storing large sets of data, Resistive Associative Processors (ReAP) have been proposed as a solution to the von Neumann bottleneck. In ReAPs, logic and memory structures are combined together to allow in-memory computations. In this paper, we propose a new algorithm to compute the matrix multiplication inside the memory that exploits the benefits of ReAP. The proposed approach is based on the Cannon algorithm and uses a series of rotations without duplicating the data. It runs in O(n), where n is the dimension of the matrix. The method also applies to a large set of row by column matrix-based applications. Experimental results show several orders of magnitude increase in performance and reduction in energy and area when compared to the latest FPGA and CPU implementations.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:15

8.4.4

HIMAP: A HIERARCHICAL MAPPING APPROACH FOR ENHANCING LIFETIME RELIABILITY OF DARK SILICON MANYCORE SYSTEMS
Speaker:
Vivek Chaturvedi, Nanyang Technological University, SG
Authors:
Vijeta Rathore¹, Vivek Chaturvedi¹, Amit Kumar Singh², Thambipillai Srikanthan¹, Rohith R¹, Siew Kei Lam¹ and Muhammad Shafique³
¹Nanyang Technological University, SG; ²University of Essex, GB; ³TU Wien, AT
Abstract
Technology scaling into the nano-scale CMOS regime has resulted in increased leakage and roadblock on voltage scaling, which has led to several issues like high power density and elevated on-chip temperature. This consequently aggravates device aging, compromising lifetime reliability of the manycore systems. This paper proposes extit{HiMap}, a dynamic hierarchical mapping approach to maximize lifetime reliability of manycore systems while satisfying performance, power, and thermal constraints. HiMap is process variation- and aging-aware. It comprises of two levels: (1) it identifies a region of cores suitable for mapping, and (2) it maps threads in the region and intersperses dark cores for thermal mitigation while considering the current health of the cores. Both the levels strive to reduce aging variance across the chip. We evaluated HiMap for 64-core and 256-core systems. Results demonstrate an improved system lifetime reliability by up to 2 years at the end of 3.25 years of use, as compared to the state-of-the-art.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

IP3-16, 906

DEMAS: AN EFFICIENT DESIGN METHODOLOGY FOR BUILDING APPROXIMATE ADDERS FOR FPGA-BASED SYSTEMS
Speaker:
Semeen Rehman, Vienna University of Technology (TU Wien), AT
Authors:
Bharath Srinivas Prabakaran¹, Semeen Rehman¹, Muhammad Abdullah Hanif¹, Salim Ullah², Ghazal Mazaheri³, Akash Kumar² and Muhammad Shafique¹
¹TU Wien, AT; ²Technische Universität Dresden, DE; ³UC Riverside, US
Abstract
The current state-of-the-art approximate adders are mostly ASIC-based, i.e., they focus solely on gate and/or transistor level approximations (e.g., through circuit simplification or truncation) to achieve area, latency, power and/or energy savings at the cost of accuracy loss. However, when these designs are synthesized for FPGA-based systems, they do not offer similar reductions in area, latency and power/energy due to the underlying architectural differences between ASICs and FPGAs. In this paper, we present a novel generic design methodology to synthesize and implement approximate adders for any FPGA-based system by considering the underlying resources and architectural differences. Using our methodology, we have designed, analyzed and presented eight different multi-bit adder architectures. Compared to the 16-bit accurate adder, our designs are successful in achieving area, latency and power-delay product gains of 50%, 38%, and 53%, respectively. We also compare our approximate adders to state-of-the-art approximate adders specialized for ASIC and FPGA fabrics and demonstrate the benefits of our approach. We will make the RTL and behavioral models of our and state-of-the-art designs open-source at https://sourceforge.net/projects/approxfpgas/ to further fuel the research and development in the FPGA community and to ensure reproducible research.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:31

IP3-17, 515

GAIN SCHEDULED CONTROL FOR NONLINEAR POWER MANAGEMENT IN CMPS
Speaker:
Nikil Dutt, University of California, Irvine, US
Authors:
Bryan Donyanavard, Amir M. Rahmani, Tiago Muck, Kasra Moazzemi and Nikil Dutt, University of California, Irvine, US
Abstract
Dynamic voltage and frequency scaling (DVFS) is a well-established technique for power management of thermal- or energy-sensitive chip multiprocessors (CMPs). In this context, linear control theoretic solutions have been successfully implemented to control the voltage-frequency knobs. However, modern CMPs with a large range of operating frequencies and multiple voltage levels display nonlinear behavior in the relationship between frequency and power. State-of-the-art linear controllers therefore leave room for opportunity in optimizing DVFS operation. We propose a Gain Scheduled Controller (GSC) for nonlinear runtime power management of CMPs that simplifies the controller implementation of systems with varying dynamic properties by utilizing an adaptive control theoretic approach in conjunction with static linear controllers. Our design improves the stability, accuracy, settling time, and overshoot of the controller over a linear controller with minimal overhead. We implement our approach on an Exynos platform containing ARM's big.LITTLE-based heterogeneous multi-processor (HMP) and demonstrate that the system's response to changes in target power is improved by 2x while operating up to 12% more efficiently.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:32

IP4-1, 376

EFFICIENT MAPPING OF QUANTUM CIRCUITS TO THE IBM QX ARCHITECTURES
Speaker:
Alwin Zulehner, Johannes Kepler University Linz, AT
Authors:
Alwin Zulehner, Alexandru Paler and Robert Wille, Johannes Kepler University Linz, AT
Abstract
In March 2017, IBM launched the project IBM Q with the goal to provide access to quantum computers for a broad audience. This allowed users to conduct quantum experiments on a 5-qubit and, since June 2017, also on a 16-qubit quantum computer (called IBM QX2 and IBM QX3, respectively). In order to use these, the desired quantum functionality (e.g. provided in terms of a quantum circuit) has to properly be mapped so that the underlying physical constraints are satisfied - a complex task. This demands for solutions to automatically and efficiently conduct this mapping process. In this paper, we propose such an approach which satisfies all constraints given by the architecture and, at the same time, aims to keep the overhead in terms of additionally required quantum gates minimal. The proposed approach is generic and can easily be configured for future architectures. Experimental evaluations show that the proposed approach clearly outperforms IBM's own mapping solution with respect to runtime as well as resulting costs.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

End of session