Date: Tuesday 28 March 2017
Time: 14:30 - 16:00
Location / Room: 5A
Chair:
Jaume Abella, Barcelona Supercomputing Center (BSC), ES
Co-Chair:
Maria K. Michael, University of Cyprus, CY
Papers in this session provide new solutions for dealing with hardware faults and metastability issues, including testing and diagnosing mechanisms for NoCs, fault recovery approaches for 3D ICs, and containment solutions for metastability in sorting networks
Time | Label | Presentation Title Authors |
---|---|---|
14:30 | 3.6.1 | CHARKA: A RELIABILITY-AWARE TEST SCHEME FOR DIAGNOSIS OF CHANNEL SHORTS BEYOND MESH NOCS Speaker: Santosh Biswas, IIT Guwahati, IN Authors: Biswajit Bhowmik1, Jatindra Kumar Deka1 and Santosh Biswas2 1IIT Guwahati, IN; 2I IT GUWAHATI, IN Abstract This paper presents a fast and low cost on-line scheme named "Charka" that analyzes short faults in channels of octagon NoCs. Experimental results demonstrate that the proposed scheme achieves 100% coverage metrics and its on-line evaluation reveals compelling effect of these faults on system performance. We observe that the proposed scheme is up to 9X faster while packet latency is improved by 13.79-21.17% and energy consumption is reduced by 17.57-24.97%. Further, the test area overhead is reduced by 13-26% that shows 52-57.77% improvement. Download Paper (PDF; Only available from the DATE venue WiFi) |
15:00 | 3.6.2 | RECOVERY-AWARE PROACTIVE TSV REPAIR FOR ELECTROMIGRATION IN 3D ICS Speaker: Shengcheng Wang, Chair of Dependable Nano Computing (CDNC), Karlsruhe Institute of Technology (KIT), DE Authors: Shengcheng Wang1, Hengyang Zhao2, Sheldon Tan3 and Mehdi Tahoori1 1Karlsruhe Institute of Technology, DE; 2University of California, Riverside, US; 3University of California at Riverside, US Abstract Electromigration (EM) becomes a major reliability concern in three-dimensional integrated-circuits (3D ICs). To mitigate this problem, a typical solution is to use TSV redundancy in a reactive manner, maintaining the operability of a 3D chip in the presence of EM failures by detecting and replacing faulty TSVs with spares. In this work, we explore an alternative, more preferred approach to enhance the EM-related lifetime reliability of TSV grid, in which redundancy is used proactively to allow non-faulty TSVs to be temporarily deactivated. In this way, EM wear-out can be reversed by exploiting its recovery property. Applied to 3D benchmark designs, the recovery-aware proactive repair approach increases EM-related lifetime reliability (measured in mean-time-to-failure) of the entire TSV grid by up to 12X relative to the conventional reactive method, with less area overhead. Download Paper (PDF; Only available from the DATE venue WiFi) |
15:30 | 3.6.3 | NEAR-OPTIMAL METASTABILITY-CONTAINING SORTING NETWORKS Speaker: Johannes Bund, Saarland University, DE Authors: Johannes Bund1, Christoph Lenzen2 and Moti Medina2 1Saarland University, DE; 2MPI-INF, DE Abstract Metastability in digital circuits is a spurious mode of operation induced by violation of setup/hold times of stateful components. It cannot be avoided deterministically when transitioning from continuously-valued to (discrete) binary signals. However, in prior work (Lenzen & Medina ASYNC 2016) it has been shown that it is possible to fully and deterministically contain the effect of metastability in sorting networks. More specifically, the sorting operation incurs no loss of precision, i.e., any inaccuracy of the output originates from mapping the continuous input range to a finite domain. The downside of this prior result is inefficiency: for B-bit inputs, the circuit for a single comparison contains Theta(B^2) gates and has depth Theta(B). In this work, we present an improved solution with near-optimal Theta(Blog B) gates and asymptotically optimal Theta(log B) depth. On the practical side, our sorting networks improves over prior work for all input lengths B > 2, e.g., for 16-bit inputs we present an improvement of more than 70% w.r.t. the depth of the sorting network and more than 60% improvement w.r.t. the cost of the sorting network. Download Paper (PDF; Only available from the DATE venue WiFi) |
16:00 | IP1-16, 267 | 3DFAR: A THREE-DIMENSIONAL FABRIC FOR RELIABLE MULTICORE PROCESSORS Speaker: Valeria Bertacco, University of Michigan-, US Authors: Javad Bagherzadeh and Valeria Bertacco, University of Michigan, US Abstract In the past decade, silicon technology trends into the nanometer regime have led to significantly higher transistor failure rates. Moreover, these trends are expected to exacerbate with future devices. To enhance reliability,several approaches leverage the inherent core-level and processor-level redundancy present in large chip multiprocessors. However, all of these methods incur high overheads, making them impractical. In this paper, we propose 3DFAR, a novel architecture leveraging 3-dimensional fabrics layouts to efficiently enhance reliability in the presence of faults. Our key idea is based on a fine-grained reconfigurable pipeline for multicore processors, which minimizes routing delay among spare units of the same type by using physical layout locality and efficient interconnect switches, distributed over multiple vertical layers. Our evaluation shows that 3DFAR outperforms state-of-the-art reliable 2D solutions, at a minimal area cost of only 7% over an unprotected design. Download Paper (PDF; Only available from the DATE venue WiFi) |
16:01 | IP1-17, 933 | EVALUATING IMPACT OF HUMAN ERRORS ON THE AVAILABILITY OF DATA STORAGE SYSTEMS Speaker: Hossein Asadi, Sharif University of Technology, IR Authors: Mostafa Kishani, Reza Eftekhari and Hossein Asadi, Sharif University of Technology, IR Abstract In this paper, we investigate the effect of incorrect disk replacement service on the availability of data storage systems. To this end, we first conduct Monte Carlo simulations to evaluate the availability of disk subsystem by considering disk failures and incorrect disk replacement service. We also propose a Markov model that corroborates the Monte Carlo simulation results. We further extend the proposed model to consider the effect of automatic disk fail-over policy. The results obtained by the proposed model show that overlooking the impact of incorrect disk replacement can result up to three orders of magnitude unavailability underestimation. Moreover, this study suggests that by considering the effect of human errors, the conventional believes about the dependability of different RAID mechanisms should be revised. The results show that in the presence of human errors, RAID1 can result in lower availability compared to RAID5. Download Paper (PDF; Only available from the DATE venue WiFi) |
16:00 | End of session Coffee Break in Exhibition Area On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area. Tuesday, March 28, 2017
Wednesday, March 29, 2017
Thursday, March 30, 2017
|