7.3 Optimizing performance, energy and predictability via hardware/software codesign

Time	Label	Presentation Title Authors
14:30	7.3.1	ACCURATE PRIVATE/SHARED CLASSIFICATION OF MEMORY ACCESSES: A RUN-TIME ANALYSIS SYSTEM FOR THE LEON3 MULTI-CORE PROCESSOR Speaker: Nam Ho, Department of Computer Science, University of Paderborn, DE Authors: Nam Ho, Ishraq Ibne Ashraf, Paul Kaufmann and Marco Platzner, Department of Computer Science, University of Paderborn, Germany, DE Abstract Related work has presented simulation-based experiments to classify data accesses in a shared memory multi-core into private and shared. This information can be used to selectively turn on/off cache coherency mechanisms for data blocks, which can save memory bus bandwidth, minimize energy consumption, and reduce application runtimes. In this paper we present an implementation of a private/shared classification mechanism on a LEON3 SPARC multi-core processor running the Linux 2.6 kernel. Our mechanism is paged-based and allows for classifying and counting data accesses at run-time. Compared to previous work, our system provides more accurate, i.e., realistic, data as it includes a real multi-core architecture and an OS. Additionally, our prototype allows us to quantitatively evaluate the overhead for the classification mechanism. We test our system with sequential and parallel benchmarks from the Mibench, ParaMibench, PARSEC, and SPLASH2 application suites. The results show that parallel benchmarks are promising targets for selectively controlling coherency mechanisms and that the run-time overheads induced by our mechanism are rather small. Download Paper (PDF; Only available from the DATE venue WiFi)
15:00	7.3.2	DESIGN OF A LOW POWER, RELATIVE TIMING BASED ASYNCHRONOUS MSP430 MICROPROCESSOR Speaker: Dipanjan Bhadra, University of Utah, US Authors: Dipanjan Bhadra and Kenneth Stevens, University of Utah, US Abstract Power dissipation is one of the primary design constraints in modern digital circuits. From a magnitude of hand-held portable devices to big data analytics using high-performance computing, low energy dissipation is a key requirement for most modern devices. This paper showcases an elegant low power circuit design methodology based on Relative Timing driven asynchronous techniques. A low power MSP430 microprocessor design based on a novel asynchronous finite state machine implementation is presented. The design showcases the power benefits of the proposed asynchronous implementation over the synchronous counterpart and avoids major architectural modification which would directly influence the performance or power consumption. The implemented asynchronous MSP430 exhibits a minimum of 8X power benefit over the synchronous design for an almost identical pipeline structure and comparable throughput. The paper further elaborates on the novel asynchronous state machine implementation used for the design and presents an efficient method to design communicating asynchronous finite state machines in clock-less systems. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30	7.3.3	A COORDINATED MULTI-AGENT REINFORCEMENT LEARNING APPROACH TO MULTI-LEVEL CACHE CO-PARTITIONING Speaker: Preeti Ranjan Panda, Indian Institute of Technology Delhi, IN Authors: Rahul Jain¹, Preeti Ranjan Panda² and Sreenivas Subramoney³ ¹Indian Institute of Technology, Delhi, IN; ²IIT Delhi, IN; ³Microarchitecture Research Lab, Intel, IN Abstract Abstract--- The widening gap between the processor and memory performance has led to the inclusion of multiple levels of caches in the modern multi‑core systems. Processors with simultaneous multithreading (SMT) support multiple hardware threads on the same physical core, which results in shared private caches. Any inefficiency in the cache hierarchy can negatively impact the system performance and motivates the need to perform a co-optimization of multiple cache levels by trading off individual application throughput for better system throughput and energy-delay-product (EDP). We propose a novel coordinated multi-agent reinforcement learning technique for performing Dynamic Cache Co-partitioning, called DCC. DCC has low implementation overhead and does not require any special hardware data profilers. We have validated our proposal with 15 8-core workloads created using Spec2006 benchmarks and found it to be an effective co-partitioning technique. DCC exhibited system throughput and EDP improvement of up to 14% (gmean:9.35%) and 19.2% (gmean: 13.5%) respectively. We believe this is the first attempt at addressing the problem of multi-level cache co-partitioning. Download Paper (PDF; Only available from the DATE venue WiFi)
15:45	7.3.4	GPIOCP: TIMING-ACCURATE GENERAL PURPOSE I/O CONTROLLERFOR MANY-CORE REAL-TIME SYSTEMS Speaker: Zhe Jiang, University of York, CN Authors: Zhe Jiang and Neil Audsley, University of York, GB Abstract Modern SoC / NoC chips often provide General-Purpose I/O (GPIO) pins for connecting devices that are not directly integrated within the chip. Timing accurate control of devices connected to GPIO is often required within embedded real-time systems -- ie. I/O operations should occur at exact times, with minimal error, neither being significantly early or late. This is difficult to achieve due to the latencies and contentions present in architecture, between CPU instigating the I/O operation, and the device connected to the GPIO -- software drivers, RTOS, buses and bus contentions all introduce significant variable latencies before the command reaches the device. This is compounded in NoC devices utilising a mesh interconnect between CPUs and I/O devices. The contribution of this paper is a resource efficient programmable I/O controller, termed the GPIO Command Processor (GPIOCP), that permits applications to instigate complex sequences of I/O operations at an exact time, so achieving timing-accuracy at a single clock cycle level. Also, I/O operations can be programmed to occur at some point in the future, periodically, or reactively. The GPIOCP is a parallel I/O controller, supporting cycle level timing accuracy across several devices connected to GPIO simultaneously. The GPIOCP exploits the tradeoff between placing using a full sequential CPU to control each GPIO connected device, which achieves some timing accuracy at high resource cost; and poor timing-accuracy achieved where the application CPU controls the device remotely. The GPIOCP has efficient hardware cost compared to CPU approaches, with the additional benefits of total timing accuracy (CPU solutions do not provide this in general) and parallel control of many I/O devices. Download Paper (PDF; Only available from the DATE venue WiFi)
16:00	IP3-10, 125	A HARDWARE IMPLEMENTATION OF THE MCAS SYNCHRONIZATION PRIMITIVE Speaker: Smruti Sarangi, IIT Delhi, IN Authors: Srishty Patel, Rajshekar Kalayappan, Ishani Mahajan and Smruti R. Sarangi, IIT Delhi, IN Abstract Lock-based parallel programs are easy to write. However, they are inherently slow as the synchronization is blocking in nature. Non-blocking lock-free programs, which use atomic instructions such as compare-and-set (CAS), are significantly faster. However, lock-free programs are notoriously difficult to design and debug. This can be greatly eased if the primitives work on multiple memory locations instead of one. We propose MCAS, a hardware implementation of a multi-word compare-and-set primitive. Ease of programming aside, MCAS- based programs are 13.8X and 4X faster on an average than lock-based and traditional lock-free programs respectively. The area overhead, in a 32-core 400mm2 chip, is a mere 0.046%. Download Paper (PDF; Only available from the DATE venue WiFi)
16:01	IP3-11, 325	BANDITS: DYNAMIC TIMING SPECULATION USING MULTI-ARMED BANDIT BASED OPTIMIZATION Speaker: Jeff Zhang, New York University, US Authors: Jeff Zhang and Siddharth Garg, New York University, US Abstract Timing speculation has recently been proposed as a method for increasing performance beyond that achievable by conventional worst-case design techniques. Starting with the observation of fast temporal variations in timing error probabilities, we propose a run-time technique to dynamically determine the optimal degree of timing speculation (i.e., how aggressively the processor is over-clocked) based on a novel formulation of the dynamic timing speculation problem as a multi-armed bandit problem. By conducting detailed post-synthesis timing simulations on a 5-stage MIPS processor running a variety of workloads, the proposed adaptive mechanism improves processor's performance significantly comparing with a competing approach (about 8.3% improvement); on the other hand, it shows only about 2.8% performance loss on average, compared with the oracle results. Download Paper (PDF; Only available from the DATE venue WiFi)
16:02	IP3-12, 261	DESIGN AND IMPLEMENTATION OF A FAIR CREDIT-BASED BANDWIDTH SHARING SCHEME FOR BUSES Speaker: Carles Hernandez, Barcelona Supercomputing Center (BSC), ES Authors: Mladen Slijepcevic¹, Carles Hernandez², Jaume Abella³ and Francisco Cazorla⁴ ¹Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, ES; ²Barcelona Supercomputing Center, ES; ³Barcelona Supercomputing Center (BSC-CNS), ES; ⁴Barcelona Supercomputing Center and IIIA-CSIC, ES Abstract Fair arbitration in the access to hardware shared resources is fundamental to obtain low worst-case execution time (WCET) estimates in the context of critical real-time systems, for which performance guarantees are essential. Several hardware mechanisms exist for managing arbitration in those resources (buses, memory controllers, etc.). They typically attain fairness in terms of the number of slots each contender (e.g., core) gets granted access to the shared resource. However, those policies may lead to unfair bandwidth allocations for workloads with contenders issuing short requests and contenders issuing long requests. We propose a Credit-Based Arbitration (CBA) mechanism that achieves fairness in the cycles each core is granted access to the resource rather than in the number of granted slots. Furthermore, we implement CBA as part of a LEON3 4-core processor for the Space domain in an FPGA proving the feasibility and good performance characteristics of the design by comparing it against other arbitration schemes. Download Paper (PDF; Only available from the DATE venue WiFi)
16:00		End of session Coffee Break in Exhibition Area On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area. Tuesday, March 28, 2017 Coffee Break 10:30 - 11:30 Coffee Break 16:00 - 17:00 Wednesday, March 29, 2017 Coffee Break 10:00 - 11:00 Coffee Break 16:00 - 17:00 Thursday, March 30, 2017 Coffee Break 10:00 - 11:00 Coffee Break 15:30 - 16:00

Time

Label

Presentation Title
Authors

14:30

7.3.1

ACCURATE PRIVATE/SHARED CLASSIFICATION OF MEMORY ACCESSES: A RUN-TIME ANALYSIS SYSTEM FOR THE LEON3 MULTI-CORE PROCESSOR
Speaker:
Nam Ho, Department of Computer Science, University of Paderborn, DE
Authors:
Nam Ho, Ishraq Ibne Ashraf, Paul Kaufmann and Marco Platzner, Department of Computer Science, University of Paderborn, Germany, DE
Abstract
Related work has presented simulation-based experiments to classify data accesses in a shared memory multi-core into private and shared. This information can be used to selectively turn on/off cache coherency mechanisms for data blocks, which can save memory bus bandwidth, minimize energy consumption, and reduce application runtimes. In this paper we present an implementation of a private/shared classification mechanism on a LEON3 SPARC multi-core processor running the Linux 2.6 kernel. Our mechanism is paged-based and allows for classifying and counting data accesses at run-time. Compared to previous work, our system provides more accurate, i.e., realistic, data as it includes a real multi-core architecture and an OS. Additionally, our prototype allows us to quantitatively evaluate the overhead for the classification mechanism. We test our system with sequential and parallel benchmarks from the Mibench, ParaMibench, PARSEC, and SPLASH2 application suites. The results show that parallel benchmarks are promising targets for selectively controlling coherency mechanisms and that the run-time overheads induced by our mechanism are rather small.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:00

7.3.2

DESIGN OF A LOW POWER, RELATIVE TIMING BASED ASYNCHRONOUS MSP430 MICROPROCESSOR
Speaker:
Dipanjan Bhadra, University of Utah, US
Authors:
Dipanjan Bhadra and Kenneth Stevens, University of Utah, US
Abstract
Power dissipation is one of the primary design constraints in modern digital circuits. From a magnitude of hand-held portable devices to big data analytics using high-performance computing, low energy dissipation is a key requirement for most modern devices. This paper showcases an elegant low power circuit design methodology based on Relative Timing driven asynchronous techniques. A low power MSP430 microprocessor design based on a novel asynchronous finite state machine implementation is presented. The design showcases the power benefits of the proposed asynchronous implementation over the synchronous counterpart and avoids major architectural modification which would directly influence the performance or power consumption. The implemented asynchronous MSP430 exhibits a minimum of 8X power benefit over the synchronous design for an almost identical pipeline structure and comparable throughput. The paper further elaborates on the novel asynchronous state machine implementation used for the design and presents an efficient method to design communicating asynchronous finite state machines in clock-less systems.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

7.3.3

A COORDINATED MULTI-AGENT REINFORCEMENT LEARNING APPROACH TO MULTI-LEVEL CACHE CO-PARTITIONING
Speaker:
Preeti Ranjan Panda, Indian Institute of Technology Delhi, IN
Authors:
Rahul Jain¹, Preeti Ranjan Panda² and Sreenivas Subramoney³
¹Indian Institute of Technology, Delhi, IN; ²IIT Delhi, IN; ³Microarchitecture Research Lab, Intel, IN
Abstract
Abstract--- The widening gap between the processor and memory performance has led to the inclusion of multiple levels of caches in the modern multi‑core systems. Processors with simultaneous multithreading (SMT) support multiple hardware threads on the same physical core, which results in shared private caches. Any inefficiency in the cache hierarchy can negatively impact the system performance and motivates the need to perform a co-optimization of multiple cache levels by trading off individual application throughput for better system throughput and energy-delay-product (EDP). We propose a novel coordinated multi-agent reinforcement learning technique for performing Dynamic Cache Co-partitioning, called DCC. DCC has low implementation overhead and does not require any special hardware data profilers. We have validated our proposal with 15 8-core workloads created using Spec2006 benchmarks and found it to be an effective co-partitioning technique. DCC exhibited system throughput and EDP improvement of up to 14% (gmean:9.35%) and 19.2% (gmean: 13.5%) respectively. We believe this is the first attempt at addressing the problem of multi-level cache co-partitioning.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:45

7.3.4

GPIOCP: TIMING-ACCURATE GENERAL PURPOSE I/O CONTROLLERFOR MANY-CORE REAL-TIME SYSTEMS
Speaker:
Zhe Jiang, University of York, CN
Authors:
Zhe Jiang and Neil Audsley, University of York, GB
Abstract
Modern SoC / NoC chips often provide General-Purpose I/O (GPIO) pins for connecting devices that are not directly integrated within the chip. Timing accurate control of devices connected to GPIO is often required within embedded real-time systems -- ie. I/O operations should occur at exact times, with minimal error, neither being significantly early or late. This is difficult to achieve due to the latencies and contentions present in architecture, between CPU instigating the I/O operation, and the device connected to the GPIO -- software drivers, RTOS, buses and bus contentions all introduce significant variable latencies before the command reaches the device. This is compounded in NoC devices utilising a mesh interconnect between CPUs and I/O devices. The contribution of this paper is a resource efficient programmable I/O controller, termed the GPIO Command Processor (GPIOCP), that permits applications to instigate complex sequences of I/O operations at an exact time, so achieving timing-accuracy at a single clock cycle level. Also, I/O operations can be programmed to occur at some point in the future, periodically, or reactively. The GPIOCP is a parallel I/O controller, supporting cycle level timing accuracy across several devices connected to GPIO simultaneously. The GPIOCP exploits the tradeoff between placing using a full sequential CPU to control each GPIO connected device, which achieves some timing accuracy at high resource cost; and poor timing-accuracy achieved where the application CPU controls the device remotely. The GPIOCP has efficient hardware cost compared to CPU approaches, with the additional benefits of total timing accuracy (CPU solutions do not provide this in general) and parallel control of many I/O devices.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:00

IP3-10, 125

A HARDWARE IMPLEMENTATION OF THE MCAS SYNCHRONIZATION PRIMITIVE
Speaker:
Smruti Sarangi, IIT Delhi, IN
Authors:
Srishty Patel, Rajshekar Kalayappan, Ishani Mahajan and Smruti R. Sarangi, IIT Delhi, IN
Abstract
Lock-based parallel programs are easy to write. However, they are inherently slow as the synchronization is blocking in nature. Non-blocking lock-free programs, which use atomic instructions such as compare-and-set (CAS), are significantly faster. However, lock-free programs are notoriously difficult to design and debug. This can be greatly eased if the primitives work on multiple memory locations instead of one. We propose MCAS, a hardware implementation of a multi-word compare-and-set primitive. Ease of programming aside, MCAS- based programs are 13.8X and 4X faster on an average than lock-based and traditional lock-free programs respectively. The area overhead, in a 32-core 400mm2 chip, is a mere 0.046%.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:01

IP3-11, 325

BANDITS: DYNAMIC TIMING SPECULATION USING MULTI-ARMED BANDIT BASED OPTIMIZATION
Speaker:
Jeff Zhang, New York University, US
Authors:
Jeff Zhang and Siddharth Garg, New York University, US
Abstract
Timing speculation has recently been proposed as a method for increasing performance beyond that achievable by conventional worst-case design techniques. Starting with the observation of fast temporal variations in timing error probabilities, we propose a run-time technique to dynamically determine the optimal degree of timing speculation (i.e., how aggressively the processor is over-clocked) based on a novel formulation of the dynamic timing speculation problem as a multi-armed bandit problem. By conducting detailed post-synthesis timing simulations on a 5-stage MIPS processor running a variety of workloads, the proposed adaptive mechanism improves processor's performance significantly comparing with a competing approach (about 8.3% improvement); on the other hand, it shows only about 2.8% performance loss on average, compared with the oracle results.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:02

IP3-12, 261

DESIGN AND IMPLEMENTATION OF A FAIR CREDIT-BASED BANDWIDTH SHARING SCHEME FOR BUSES
Speaker:
Carles Hernandez, Barcelona Supercomputing Center (BSC), ES
Authors:
Mladen Slijepcevic¹, Carles Hernandez², Jaume Abella³ and Francisco Cazorla⁴
¹Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, ES; ²Barcelona Supercomputing Center, ES; ³Barcelona Supercomputing Center (BSC-CNS), ES; ⁴Barcelona Supercomputing Center and IIIA-CSIC, ES
Abstract
Fair arbitration in the access to hardware shared resources is fundamental to obtain low worst-case execution time (WCET) estimates in the context of critical real-time systems, for which performance guarantees are essential. Several hardware mechanisms exist for managing arbitration in those resources (buses, memory controllers, etc.). They typically attain fairness in terms of the number of slots each contender (e.g., core) gets granted access to the shared resource. However, those policies may lead to unfair bandwidth allocations for workloads with contenders issuing short requests and contenders issuing long requests. We propose a Credit-Based Arbitration (CBA) mechanism that achieves fairness in the cycles each core is granted access to the resource rather than in the number of granted slots. Furthermore, we implement CBA as part of a LEON3 4-core processor for the Space domain in an FPGA proving the feasibility and good performance characteristics of the design by comparing it against other arbitration schemes.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:00

End of session
Coffee Break in Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017