4.4 Exploring Reliability and Efficiency Tradeoffs at the Architectural Level

Date: Tuesday 10 March 2015
Time: 17:00 - 18:30
Location / Room: Chartreuse

Chair:
Todd Austin, University of Michigan, US

Co-Chair:
Gunar Schirner, Northeastern University, US

This session targets architectural solutions for energy-efficient and reliable memories and processors.

Time	Label	Presentation Title Authors
17:00	4.4.1	SOFT-ERROR RELIABILITY AND POWER CO-OPTIMIZATION FOR GPGPUS REGISTER FILE USING RESISTIVE MEMORY Speakers: Jingweijia Tan¹, Zhi Li² and Xin Fu¹ ¹University of Houston, US; ²University of Kansas, US Abstract The increasing adoption of graphics processing units (GPUs) for high-performance computing raises the reliability challenge, which is generally ignored in traditional GPUs. GPUs usually support thousands of parallel threads and require a sizable register file. Such large register file is highly susceptible to soft errors and power-hungry. Although ECC has been adopted to register file in modern GPUs, it causes considerable power overhead, which further increases the power stress. Thus, an energy-efficient soft-error protection mechanism is more desirable. Besides its extremely low leakage power consumption, resistive memory (e.g. spin-transfer torque RAM) is also immune to the radiation induced soft errors due to its magnetic field based storage. In this paper, we propose to LEverage reSistive memory to enhance the Soft-error robustness and reduce the power consumption (LESS) of registers in the General-Purpose computing on GPUs (GPGPUs). Since resistive memory experiences longer write latency compared to SRAM, we explore the unique characteristics of GPGPU applications to obtain the win-win gains: achieving the near-full soft-error protection for the register file, and meanwhile substantially reducing the energy consumption with negligible performance loss. Our experimental results show that LESS is able to mitigate the registers soft-error vulnerability by 86% and achieve 60% energy savings with negligible (e.g. 4%) performance loss. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30	4.4.2	ENERGY-EFFICIENT CACHE DESIGN IN EMERGING MOBILE PLATFORMS: THE IMPLICATIONS AND OPTIMIZATIONS Speakers: Kaige Yan and Xin Fu, University of Houston, US Abstract Mobile devices are quickly becoming the most widely used processors in consumer devices. Since their major power supply is battery, the energy-efficient computing is highly desired. In this paper, we focus on the energy-efficient cache design in emerging mobile platforms. We observe that more than 40% of L2 cache accesses are OS kernel accesses in interactive smartphone applications. Such frequent kernel accesses cause serious interferences between the user and kernel blocks in the L2 cache, leading to the unnecessary block replacements and high L2 cache miss rate. We propose to partition the L2 cache into two separate segments which can only be accessed by the user code and kernel code, respectively. Meanwhile, the overall size of the two segments is shrunk, which greatly reduces the energy consumption by 15% while still maintains the similar cache miss rate. We further find completely different access behaviors between the two separated kernel and user segments in our novel L2 cache design, and explore the multi-retention STT-RAM based user and kernel segments to maximize the cache energy savings. The experimental results show that our techniques significantly reduce the cache energy consumption (e.g. 75%) with only 2% performance loss in emerging smartphones. Download Paper (PDF; Only available from the DATE venue WiFi)
18:00	4.4.3	EXPLOITING DYNAMIC TIMING MARGINS IN MICROPROCESSORS FOR FREQUENCY-OVER-SCALING WITH INSTRUCTION-BASED CLOCK ADJUSTMENT Speakers: Jeremy Constantin¹, Lai Wang², Georgios Karakonstantis³, Anupam Chattopadhyay² and Andreas Burg¹ ¹École Polytechnique Fédérale de Lausanne (EPFL), CH; ²RWTH Aachen, DE; ³Queen's University, GB Abstract Static timing analysis provides the basis for setting the clock period of a microprocessor core, based on its worst-case critical path. However, depending on the design, this critical path is not always excited and therefore dynamic timing margins exist that can theoretically be exploited for the benefit of better speed or lower power consumption (through voltage scaling). This paper introduces predictive instruction-based dynamic clock adjustment as a technique to trim dynamic timing margins in pipelined microprocessors. To this end, we exploit the different timing requirements for individual instructions during the dynamically varying program execution flow without the need for complex circuit-level measures to detect and correct timing violations. We provide a design flow to extract the dynamic timing information for the design using post-layout dynamic timing analysis and we integrate the results into a custom cycle-accurate simulator. This simulator allows annotation of individual instructions with their impact on timing (in each pipeline stage) and rapidly derives the overall code execution time for complex benchmarks. The design methodology is illustrated at the microarchitecture level, demonstrating the performance and power gains possible on a 6-stage OpenRISC in-order general purpose processor core in a 28 nm CMOS technology. We show that employing instruction-dependent dynamic clock adjustment leads on average to an increase in operating speed by 38% or to a reduction in power consumption by 24%, compared to traditional synchronous clocking, which at all times has to respect the worst-case timing identified through static timing analysis. Download Paper (PDF; Only available from the DATE venue WiFi)
18:15	4.4.4	VARIABILITY-AWARE DARK SILICON MANAGEMENT IN ON-CHIP MANY-CORE SYSTEMS Speakers: Muhammad Shafique¹, Dennis Gnad¹, Siddharth Garg² and Joerg Henkel¹ ¹Karlsruhe Institute of Technology (KIT), DE; ²University of Waterloo, CA Abstract Dark Silicon refers to the constraint that only a fraction of on-chip resources (cores) can be simultaneously powered-on (running at full performance) in order to stay within the allowable power budget and safe temperature limits, while others remain 'dark'. In this paper, we demonstrate how these 'dark cores' can be leveraged to improve the temperature profile at run-time, thus providing opportunities to power-on more cores at the nominal voltage than the number allowed when strictly obeying the conventional Thermal Design Power (TDP) constraint. In this paper, we propose a computationally efficient dark silicon management technique that determines the best set of cores to keep dark and the mapping of threads to cores at run-time, while also accounting for the impact of process variations. We have developed a light-weight temperature prediction mechanism that determines the impact of different candidate solutions on the chip thermal profile. Experimental evaluation of the proposed techniques on a simulated 8×8 many-core processor, and across a range of chips to account for process variations, show that the total instruction throughput is increased by 1.8× on average while keeping the temperature within the safe limits, when compared with state-of-the-art approaches. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30		End of session