7.3 Optimizing performance, energy and predictability via hardware/software codesign

Printer-friendly version PDF version

Date: Wednesday 29 March 2017
Time: 14:30 - 16:00
Location / Room: 2BC

Chair:
Sasan Avesta, George Mason University, US

Co-Chair:
Stefano Di Carlo, Politecnico di Torino, IT

This session presents a variety of architectural solutions to improve performance/energy/predictability covering several hardware blocks: processor pipeline, caches, memory and on-chip I/O. The first paper proposes a hardware/software mechanism to classify accesses as private or shared. The second paper, introduces a low-power asynchronous microprocessor design. The third paper proposes a coordinated approach to improve performance by partitioning multilevel caches. And the last paper proposes a hardware approach to increase the timing accuracy of I/O operations.

TimeLabelPresentation Title
Authors
14:307.3.1ACCURATE PRIVATE/SHARED CLASSIFICATION OF MEMORY ACCESSES: A RUN-TIME ANALYSIS SYSTEM FOR THE LEON3 MULTI-CORE PROCESSOR
Speaker:
Nam Ho, Department of Computer Science, University of Paderborn, DE
Authors:
Nam Ho, Ishraq Ibne Ashraf, Paul Kaufmann and Marco Platzner, Department of Computer Science, University of Paderborn, Germany, DE
Abstract
Related work has presented simulation-based experiments to classify data accesses in a shared memory multi-core into private and shared. This information can be used to selectively turn on/off cache coherency mechanisms for data blocks, which can save memory bus bandwidth, minimize energy consumption, and reduce application runtimes. In this paper we present an implementation of a private/shared classification mechanism on a LEON3 SPARC multi-core processor running the Linux 2.6 kernel. Our mechanism is paged-based and allows for classifying and counting data accesses at run-time. Compared to previous work, our system provides more accurate, i.e., realistic, data as it includes a real multi-core architecture and an OS. Additionally, our prototype allows us to quantitatively evaluate the overhead for the classification mechanism. We test our system with sequential and parallel benchmarks from the Mibench, ParaMibench, PARSEC, and SPLASH2 application suites. The results show that parallel benchmarks are promising targets for selectively controlling coherency mechanisms and that the run-time overheads induced by our mechanism are rather small.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:007.3.2DESIGN OF A LOW POWER, RELATIVE TIMING BASED ASYNCHRONOUS MSP430 MICROPROCESSOR
Speaker:
Dipanjan Bhadra, University of Utah, US
Authors:
Dipanjan Bhadra and Kenneth Stevens, University of Utah, US
Abstract
Power dissipation is one of the primary design constraints in modern digital circuits. From a magnitude of hand-held portable devices to big data analytics using high-performance computing, low energy dissipation is a key requirement for most modern devices. This paper showcases an elegant low power circuit design methodology based on Relative Timing driven asynchronous techniques. A low power MSP430 microprocessor design based on a novel asynchronous finite state machine implementation is presented. The design showcases the power benefits of the proposed asynchronous implementation over the synchronous counterpart and avoids major architectural modification which would directly influence the performance or power consumption. The implemented asynchronous MSP430 exhibits a minimum of 8X power benefit over the synchronous design for an almost identical pipeline structure and comparable throughput. The paper further elaborates on the novel asynchronous state machine implementation used for the design and presents an efficient method to design communicating asynchronous finite state machines in clock-less systems.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:307.3.3A COORDINATED MULTI-AGENT REINFORCEMENT LEARNING APPROACH TO MULTI-LEVEL CACHE CO-PARTITIONING
Speaker:
Preeti Ranjan Panda, Indian Institute of Technology Delhi, IN
Authors:
Rahul Jain1, Preeti Ranjan Panda2 and Sreenivas Subramoney3
1Indian Institute of Technology, Delhi, IN; 2IIT Delhi, IN; 3Microarchitecture Research Lab, Intel, IN
Abstract
Abstract--- The widening gap between the processor and memory performance has led to the inclusion of multiple levels of caches in the modern multi‑­core systems. Processors with simultaneous multithreading (SMT) support multiple hardware threads on the same physical core, which results in shared private caches. Any inefficiency in the cache hierarchy can negatively impact the system performance and motivates the need to perform a co-optimization of multiple cache levels by trading off individual application throughput for better system throughput and energy-delay-product (EDP). We propose a novel coordinated multi-agent reinforcement learning technique for performing Dynamic Cache Co-partitioning, called DCC. DCC has low implementation overhead and does not require any special hardware data profilers. We have validated our proposal with 15 8-core workloads created using Spec2006 benchmarks and found it to be an effective co-partitioning technique. DCC exhibited system throughput and EDP improvement of up to 14% (gmean:9.35%) and 19.2% (gmean: 13.5%) respectively. We believe this is the first attempt at addressing the problem of multi-level cache co-partitioning.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:457.3.4GPIOCP: TIMING-ACCURATE GENERAL PURPOSE I/O CONTROLLERFOR MANY-CORE REAL-TIME SYSTEMS
Speaker:
Zhe Jiang, University of York, CN
Authors:
Zhe Jiang and Neil Audsley, University of York, GB
Abstract
Modern SoC / NoC chips often provide General-Purpose I/O (GPIO) pins for connecting devices that are not directly integrated within the chip. Timing accurate control of devices connected to GPIO is often required within embedded real-time systems -- ie. I/O operations should occur at exact times, with minimal error, neither being significantly early or late. This is difficult to achieve due to the latencies and contentions present in architecture, between CPU instigating the I/O operation, and the device connected to the GPIO -- software drivers, RTOS, buses and bus contentions all introduce significant variable latencies before the command reaches the device. This is compounded in NoC devices utilising a mesh interconnect between CPUs and I/O devices. The contribution of this paper is a resource efficient programmable I/O controller, termed the GPIO Command Processor (GPIOCP), that permits applications to instigate complex sequences of I/O operations at an exact time, so achieving timing-accuracy at a single clock cycle level. Also, I/O operations can be programmed to occur at some point in the future, periodically, or reactively. The GPIOCP is a parallel I/O controller, supporting cycle level timing accuracy across several devices connected to GPIO simultaneously. The GPIOCP exploits the tradeoff between placing using a full sequential CPU to control each GPIO connected device, which achieves some timing accuracy at high resource cost; and poor timing-accuracy achieved where the application CPU controls the device remotely. The GPIOCP has efficient hardware cost compared to CPU approaches, with the additional benefits of total timing accuracy (CPU solutions do not provide this in general) and parallel control of many I/O devices.

Download Paper (PDF; Only available from the DATE venue WiFi)
16:00IP3-10, 125A HARDWARE IMPLEMENTATION OF THE MCAS SYNCHRONIZATION PRIMITIVE
Speaker:
Smruti Sarangi, IIT Delhi, IN
Authors:
Srishty Patel, Rajshekar Kalayappan, Ishani Mahajan and Smruti R. Sarangi, IIT Delhi, IN
Abstract
Lock-based parallel programs are easy to write. However, they are inherently slow as the synchronization is blocking in nature. Non-blocking lock-free programs, which use atomic instructions such as compare-and-set (CAS), are significantly faster. However, lock-free programs are notoriously difficult to design and debug. This can be greatly eased if the primitives work on multiple memory locations instead of one. We propose MCAS, a hardware implementation of a multi-word compare-and-set primitive. Ease of programming aside, MCAS- based programs are 13.8X and 4X faster on an average than lock-based and traditional lock-free programs respectively. The area overhead, in a 32-core 400mm2 chip, is a mere 0.046%.

Download Paper (PDF; Only available from the DATE venue WiFi)
16:01IP3-11, 325BANDITS: DYNAMIC TIMING SPECULATION USING MULTI-ARMED BANDIT BASED OPTIMIZATION
Speaker:
Jeff Zhang, New York University, US
Authors:
Jeff Zhang and Siddharth Garg, New York University, US
Abstract
Timing speculation has recently been proposed as a method for increasing performance beyond that achievable by conventional worst-case design techniques. Starting with the observation of fast temporal variations in timing error probabilities, we propose a run-time technique to dynamically determine the optimal degree of timing speculation (i.e., how aggressively the processor is over-clocked) based on a novel formulation of the dynamic timing speculation problem as a multi-armed bandit problem. By conducting detailed post-synthesis timing simulations on a 5-stage MIPS processor running a variety of workloads, the proposed adaptive mechanism improves processor's performance significantly comparing with a competing approach (about 8.3% improvement); on the other hand, it shows only about 2.8% performance loss on average, compared with the oracle results.

Download Paper (PDF; Only available from the DATE venue WiFi)
16:02IP3-12, 261DESIGN AND IMPLEMENTATION OF A FAIR CREDIT-BASED BANDWIDTH SHARING SCHEME FOR BUSES
Speaker:
Carles Hernandez, Barcelona Supercomputing Center (BSC), ES
Authors:
Mladen Slijepcevic1, Carles Hernandez2, Jaume Abella3 and Francisco Cazorla4
1Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, ES; 2Barcelona Supercomputing Center, ES; 3Barcelona Supercomputing Center (BSC-CNS), ES; 4Barcelona Supercomputing Center and IIIA-CSIC, ES
Abstract
Fair arbitration in the access to hardware shared resources is fundamental to obtain low worst-case execution time (WCET) estimates in the context of critical real-time systems, for which performance guarantees are essential. Several hardware mechanisms exist for managing arbitration in those resources (buses, memory controllers, etc.). They typically attain fairness in terms of the number of slots each contender (e.g., core) gets granted access to the shared resource. However, those policies may lead to unfair bandwidth allocations for workloads with contenders issuing short requests and contenders issuing long requests. We propose a Credit-Based Arbitration (CBA) mechanism that achieves fairness in the cycles each core is granted access to the resource rather than in the number of granted slots. Furthermore, we implement CBA as part of a LEON3 4-core processor for the Space domain in an FPGA proving the feasibility and good performance characteristics of the design by comparing it against other arbitration schemes.

Download Paper (PDF; Only available from the DATE venue WiFi)
16:00End of session
Coffee Break in Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017

  • Coffee Break 10:30 - 11:30
  • Coffee Break 16:00 - 17:00

Wednesday, March 29, 2017

  • Coffee Break 10:00 - 11:00
  • Coffee Break 16:00 - 17:00

Thursday, March 30, 2017

  • Coffee Break 10:00 - 11:00
  • Coffee Break 15:30 - 16:00