4.6 Online Testing and Reliable Memories

Printer-friendly version PDF version

Date: Tuesday 10 March 2015
Time: 17:00 - 18:30
Location / Room: Bayard

Chair:
Mihalis Psarakis, University of Piraeus, GR

Co-Chair:
Cristiana Bolchini, Politecnico di Milano, IT

Temperature- and power-aware solutions are proposed for self- and on-line testing, together with innovative fault detection and reconfiguration schemes for caches and emerging memory technologies

TimeLabelPresentation Title
Authors
17:004.6.1A DEFECT-AWARE RECONFIGURABLE CACHE ARCHITECTURE FOR LOW-VCCMIN DVFS-ENABLED SYSTEMS
Speakers:
Michail Mavropoulos, Georgios Keramidas and Dimitris Nikolos, University of Patras, GR
Abstract
As process technology continues to shrink, due to manufacturing defects and process variations, a large number of bitcells in on-chip caches is expected to be faulty. The number of defective cells varies from die-to-die, wafer-to-wafer, and in the field of application depends on the run-time operating conditions (e.g., supply voltage and frequency). Those trends necessitate i) to study fault-tolerant (FT) cache mechanisms in a wide spectrum of fault- probabilities and ii) to devise appropriate FT cache techniques that must be able to adapt their fault tolerance capacity to the volume of defective locations of the target faulty caches. It is well known that keeping the cache capacity, block size and the volume of defective cells constant, the average number of misses due to faulty cells in general decreases as the associativity of the cache increases. To this end we propose DARCA, a Defect-Aware Reconfigurable Cache Architecture, which is equipped with the ability of dynamically varying its associativity according to the volume of the defective cells. To keep the hardware overhead very small, as the associativity of the cache is multiplied by a power of two, its block size is divided by the same number. Since almost all contemporary processors use prefetching, we also applied DARCA to prefetch-assisted caches. By performing cycle-accurate simulations for the SPEC2006 benchmark suite and assuming a plethora of fault maps and a wide range of fault-probabilities we showed that DARCA compares favorably against several already known FT cache mechanisms with respect to the performance loss caused by defective cells.

Download Paper (PDF; Only available from the DATE venue WiFi)
17:304.6.2TEMPERATURE-AWARE SOFTWARE-BASED SELF-TESTING FOR DELAY FAULTS
Speakers:
Ying Zhang1, Zebo Peng2, Jianhui Jiang3, Huawei Li4 and Masahiro Fujita5
1Tongji University, Shanghai, China, CN; 2Embedded Systems Lab, Linköping University, SE; 3School of Software Engineering, Tongji University, CN; 4Institute of Computing Technology, Chinese Academy of Sciences, CN; 5VLSI Design and Education Center, University of Tokyo, JP
Abstract
Delay defects under high temperature have been one of the most critical factors to affect the reliability of computer systems, and the current test methods don't address this problem properly. In this paper, temperature-aware software-based self-testing (SBST) technique is proposed to self-heat the processors within a high temperature range and effectively test delay faults under high temperature. First, it automatically generates high-quality test programs through automatic test instruction generation (ATIG), and avoids over-testing caused by nonfunctional patterns. Second, it exploits two effective power-intensive program transformations to self-heat up the processors internally. Third, it applies a greedy algorithm to search the optimized schedule of the test templates in order to generate the test program while making sure that the temperature of the processor under test is within the specified range. Experimental results show that the generated program is successful to guarantee delay test within the given temperature range, and achieves high test performance with functional patterns.

Download Paper (PDF; Only available from the DATE venue WiFi)
18:004.6.3OPERATIONAL FAULT DETECTION AND MONITORING OF A MEMRISTOR-BASED LUT
Speakers:
Nandha kumar Thulasiraman1, Haider A.F. Almurib1 and Fabrizio Lombardi2
1The University of Nottingham, MY; 2Northeastern University, US
Abstract
This paper presents a method for operational testing of a memristor-based memory look-up table (LUT). In the proposed method, the deterioration of the memristors (as storage elements of a LUT) is modeled based on the reduction of the resistance range as observed in fabricated devices and recently reported in the technical literature. A quiescent current technique is used for testing the memristors when deterioration results in a change of state, thus leading to an erroneous (faulty) operation. An equivalent circuit model of the operational deterioration for a memristor-based LUT is presented. In addition to modeling and testing, the proposed method can be utilized also for continuous monitoring of the LUT in the presence of memristor deterioration in the LUT. The proposed method is assessed using LTSPICE; extensive simulation results are presented with respect to different operational features, such as LUT dimension and range of resistance. These results show that the proposed test method is scalable with LUT dimension and highly efficient for testing and monitoring a LUT in the presence of deteriorating multiple memristors.

Download Paper (PDF; Only available from the DATE venue WiFi)
18:154.6.4POWER-AWARE ONLINE TESTING OF MANYCORE SYSTEMS IN THE DARK SILICON ERA
Speakers:
Mohammad-Hashem Haghbayan1, Amir-Mohammad Rahmani1, Mohammad Fattah1, Pasi Liljeberg1, Juha Plosila1, Hannu Tenhunen2 and Zainalabedin Navabi3
1University of Turku, FI; 2KTH Royal Institute of Technology, SE; 3Worcester Polytechnic Institute, US
Abstract
Online defect screening techniques to detect run-time faults are becoming a necessity in current and near future technologies. At the same time, due to aggressive technology scaling into the nanometer regime, power consumption is becoming a significant burden. Most of today's chips employ advanced power management features to monitor the power consumption and apply dynamic power budgeting (i.e., capping) accordingly to prevent over-heating of the chip. Given the notable power dissipation of existing testing methods, one needs to efficiently manage the power budget to cover test process of a manycore system in runtime. In this paper, we propose a power-aware online testing method for many-core systems benefiting from advanced power management capabilities. The proposed power-aware method uses non-intrusive online test scheduling strategy to functionally test the cores in their idle period. In addition, we propose a test-aware utilization-oriented runtime mapping technique that considers the utilization of cores and their test criticality in the mapping process. Our extensive experimental results reveal that the proposed power-aware online testing approach can efficiently utilize temporarily free resources and available power budget for the testing purposes, within less than 1% penalty for the 16nm technology.

Download Paper (PDF; Only available from the DATE venue WiFi)
18:30IP2-2, 940ON-LINE PREDICTION OF NBTI-INDUCED AGING RATES
Speakers:
Rafal Baranowski1, Farshad Firouzi2, Saman Kiamehr2, Chang Liu1, Hans-Joachim Wunderlich1 and Mehdi Tahoori2
1Stuttgart University, DE; 2Karlsruhe Institute of Technology (KIT), DE
Abstract
Nanoscale technologies are increasingly susceptible to aging processes such as Negative-Bias Temperature Instability (NBTI) which undermine the reliability of VLSI systems. Existing monitoring techniques can detect the violation of safety margins and hence make the prediction of an imminent failure possible. However, since such techniques can only detect measurable degradation effects which appear after a relatively long period of system operation, they are not well suited to early aging prediction and proactive aging alleviation. This work presents a novel method for the monitoring of NBTI-induced degradation rate in digital circuits. It enables the timely adoption of proper mitigation techniques that reduce the impact of aging. The proposed method employs machine learning techniques to find a small set of so called Representative Critical Gates (RCG), the workload of which is correlated with the degradation of the entire circuit. The workload of RCGs is observed in hardware using so called workload monitors. The output of the workload monitors is evaluated on-line to predict system degradation experienced within a configurable (short) period of time, e.g. a fraction of a second. Experimental results show that the proposed monitors predict the degradation rate with an average error of only 3% at less than 2.4% area overhead.

Download Paper (PDF; Only available from the DATE venue WiFi)
18:30End of session