11.4 Evaluating and optimizing memory and timing across HW and SW boundaries

Time	Label	Presentation Title Authors
14:00	11.4.1	(Best Paper Award Candidate) HME: A LIGHTWEIGHT EMULATOR FOR HYBRID MEMORY Speaker: Jie Xu, Huazhong University of Science and Technology, CN Authors: Zhuohui Duan, Haikun Liu, Xiaofei Liao and Hai Jin, Huazhong University of Science and Technology, CN Abstract Emerging non-volatile memory (NVM) technologies have been widely studied in recent years. Those studies mainly rely on cycle-accurate architecture simulators because the commercial NVM hardware is still unavailable. However, current simulation approaches are either too slow, or cannot simulate complex and large-scale workloads. In this paper, we propose a DRAM-based hybrid memory emulator, called HME, to emulate the performance characteristics of NVM devices. HME exploits hardware features available in commodity Non-Uniform Memory Access (NUMA) architectures to emulate two kinds of memories: fast, local DRAM, and slower, remote NVM on other NUMA nodes. HME can emulate a wide range of NVM latencies by injecting software-created memory access delays on the remote NUMA nodes. To evaluate the impact of hybrid memories on the application performance, we also provide application programming interfaces to allocate memory from NVM or DRAM regions. We evaluate the accuracy of the read/write delay injection models by using SPEC CPU2006 and compare the results with a state-of-the-art NVM emulator Quartz. Experimental results demonstrate that the average emulation errors of NVM read and write latencies are less than 5% in HME, which is much lower than Quartz. Moreover, the application performance overhead in HME is one order of magnitude lower than Quartz. Download Paper (PDF; Only available from the DATE venue WiFi)
14:30	11.4.2	VERC3: A LIBRARY FOR EXPLICIT STATE SYNTHESIS OF CONCURRENT SYSTEMS Speaker: Marco Elver, University of Edinburgh, GB Authors: Marco Elver, Christopher J. Banks, Paul Jackson and Vijay Nagarajan, University of Edinburgh, GB Abstract We propose an alternative, explicit state only, approach to concurrent system synthesis. In particular, the focus of this work is on the synthesis of distributed protocols. Given a correctness specification and a protocol skeleton (i.e. incomplete with holes), the goal is to synthesize the holes. At the heart of our technique is a dynamic programming based algorithm that prunes inferred failure candidates. The algorithm exploits the fact that typically only a few transitions are needed to reach an erroneous state in a faulty distributed protocol. Therefore, it is unlikely that every hole to be synthesized is contributing towards the error; thus, faulty protocol candidates where only a subset of holes were used can be used to infer failures of later candidates with a superset of holes. We evaluate the tool using a cache coherence protocol synthesis case study. Specifically, we study a directory based MSI protocol, assuming an unordered interconnect which gives rise to numerous race conditions which must be resolved via introducing transient states---a common cause of complexity and bugs in such protocols. In the case study, we therefore focus on synthesizing the transient state actions (we consider up to 12 holes out of possible 35). With the proposed candidate pruning optimization, we report up to 43x improvement over a naive candidate enumeration scheme. We make available the tool and C++ library, VerC3. Download Paper (PDF; Only available from the DATE venue WiFi)
15:00	11.4.3	PROMETHEUS: PROCESSING-IN-MEMORY HETEROGENEOUS ARCHITECTURE DESIGN FROM A MULTI-LAYER NETWORK THEORETIC STRATEGY Speaker: Yao Xiao, USC, US Authors: Yao Xiao, Shahin Nazarian and Paul Bogdan, University of Southern California, US Abstract With increasing demand for distributed intelligent physical systems performing big data analytics on the field and in real-time, processing-in-memory (PIM) architectures integrating 3D-stacked memory and logic layers could provide higher performance and energy efficiency. Towards this end, the PIM design requires principled and rigorous optimization strategies to identify interactions and manage data movement across different vaults. In this paper, we introduce Prometheus, a novel PIM-based framework that constructs a comprehensive model of computation and communication (MoCC) based on a static and dynamic compilation of an application. Firstly, by adopting a low level virtual machine (LLVM) intermediate representation (IR), an input application is modeled as a two-layered graph consisting of (i) a computation layer in which the nodes denote computation IR instructions and edges denote data dependencies among instructions, and (ii) a communication layer in which the nodes denote memory operations (e.g., load/store) and edges represent memory dependencies detected by alias analysis. Secondly, we develop an optimization framework that partitions the multi-layer network into processing communities within which the computational workload is maximized while balancing the load among computational clusters. Thirdly, we propose a community-to-vault mapping algorithm for designing a scalable hybrid memory cube (HMC)-based system where vaults are interconnected through a network-on-chip (NoC) approach rather than a crossbar architecture. This ensures scalability to hundreds of vaults in each cube. Experimental results demonstrate that Prometheus consisting of 64 HMC-based vaults improves system performance by 9.8x and achieves 2.3x energy reduction, compared to conventional systems. Download Paper (PDF; Only available from the DATE venue WiFi)
15:15	11.4.4	ADVANCING SOURCE-LEVEL TIMING SIMULATION USING LOOP ACCELERATION Speaker: Joscha Benz, University of Tuebingen, DE Authors: Joscha Benz¹, Christoph Gerum¹ and Oliver Bringmann² ¹University of Tuebingen, DE; ²University of Tuebingen / FZI, DE Abstract Source-level timing simulation (STLS) is an important technique for early examination of timing behavior, as it is very fast and accurate. A factor occasionally more important than precision is simulation speed, especially in design space exploration or very early phases of development. Additionally, practices like rapid prototyping also benefit from high-performance timing simulation. Therefore, we propose to further reduce simulation run-time by utilizing a method called loop acceleration. Accelerating a loop in the context of SLTS means deriving the timing of a loop prior to simulation to increase simulation speed of that loop. We integrated this technique in our SLTS framework and conducted an comprehensive evaluation using the Mälardalen benchmark suite. We were able to reduce simulation time by up to 43% of the original time, while the introduced accuracy loss did not exceed 8 percentage points. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30	IP5-13, 782	IN-MEMORY COMPUTING USING PATHS-BASED LOGIC AND HETEROGENEOUS COMPONENTS Speaker: Alvaro Velasquez, University of Central Florida, US Authors: Alvaro Velasquez and Sumit Kumar Jha, University of Central Florida, US Abstract The memory-processor bottleneck and scaling difficulties of the CMOS transistor have given rise to a plethora of research initiatives to overcome these challenges. Popular among these is in-memory crossbar computing. In this paper, we propose a framework for synthesizing logic-in-memory circuits based on the behavior of paths of electric current throughout the memory. Limitations of using only bidirectional components with this approach are also established. We demonstrate the effectiveness of our approach by generating n-bit addition circuits that can compute using a constant number of read and write cycles. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30		End of session Coffee Break in Exhibition Area Coffee Breaks in the Exhibition Area On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area (Terrace Level of the ICCD). Lunch Breaks (Großer Saal + Saal 1) On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the rooms "Großer Saal" and "Saal 1" (Saal Level of the ICCD) to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area. Tuesday, March 20, 2018 Coffee Break 10:30 - 11:30 Lunch Break 13:00 - 14:30 Awards Presentation and Keynote Lecture in "Saal 2" 13:50 - 14:20 Coffee Break 16:00 - 17:00 Wednesday, March 21, 2018 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:30 Awards Presentation and Keynote Lecture in "Saal 2" 13:30 - 14:20 Coffee Break 16:00 - 17:00 Thursday, March 22, 2018 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:00 Keynote Lecture in "Saal 2" 13:20 - 13:50 Coffee Break 15:30 - 16:00

Time

Label

Presentation Title
Authors

14:00

11.4.1

(Best Paper Award Candidate)
HME: A LIGHTWEIGHT EMULATOR FOR HYBRID MEMORY
Speaker:
Jie Xu, Huazhong University of Science and Technology, CN
Authors:
Zhuohui Duan, Haikun Liu, Xiaofei Liao and Hai Jin, Huazhong University of Science and Technology, CN
Abstract
Emerging non-volatile memory (NVM) technologies have been widely studied in recent years. Those studies mainly rely on cycle-accurate architecture simulators because the commercial NVM hardware is still unavailable. However, current simulation approaches are either too slow, or cannot simulate complex and large-scale workloads. In this paper, we propose a DRAM-based hybrid memory emulator, called HME, to emulate the performance characteristics of NVM devices. HME exploits hardware features available in commodity Non-Uniform Memory Access (NUMA) architectures to emulate two kinds of memories: fast, local DRAM, and slower, remote NVM on other NUMA nodes. HME can emulate a wide range of NVM latencies by injecting software-created memory access delays on the remote NUMA nodes. To evaluate the impact of hybrid memories on the application performance, we also provide application programming interfaces to allocate memory from NVM or DRAM regions. We evaluate the accuracy of the read/write delay injection models by using SPEC CPU2006 and compare the results with a state-of-the-art NVM emulator Quartz. Experimental results demonstrate that the average emulation errors of NVM read and write latencies are less than 5% in HME, which is much lower than Quartz. Moreover, the application performance overhead in HME is one order of magnitude lower than Quartz.
Download Paper (PDF; Only available from the DATE venue WiFi)

14:30

11.4.2

VERC3: A LIBRARY FOR EXPLICIT STATE SYNTHESIS OF CONCURRENT SYSTEMS
Speaker:
Marco Elver, University of Edinburgh, GB
Authors:
Marco Elver, Christopher J. Banks, Paul Jackson and Vijay Nagarajan, University of Edinburgh, GB
Abstract
We propose an alternative, explicit state only, approach to concurrent system synthesis. In particular, the focus of this work is on the synthesis of distributed protocols. Given a correctness specification and a protocol skeleton (i.e. incomplete with holes), the goal is to synthesize the holes. At the heart of our technique is a dynamic programming based algorithm that prunes inferred failure candidates. The algorithm exploits the fact that typically only a few transitions are needed to reach an erroneous state in a faulty distributed protocol. Therefore, it is unlikely that every hole to be synthesized is contributing towards the error; thus, faulty protocol candidates where only a subset of holes were used can be used to infer failures of later candidates with a superset of holes. We evaluate the tool using a cache coherence protocol synthesis case study. Specifically, we study a directory based MSI protocol, assuming an unordered interconnect which gives rise to numerous race conditions which must be resolved via introducing transient states---a common cause of complexity and bugs in such protocols. In the case study, we therefore focus on synthesizing the transient state actions (we consider up to 12 holes out of possible 35). With the proposed candidate pruning optimization, we report up to 43x improvement over a naive candidate enumeration scheme. We make available the tool and C++ library, VerC3.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:00

11.4.3

PROMETHEUS: PROCESSING-IN-MEMORY HETEROGENEOUS ARCHITECTURE DESIGN FROM A MULTI-LAYER NETWORK THEORETIC STRATEGY
Speaker:
Yao Xiao, USC, US
Authors:
Yao Xiao, Shahin Nazarian and Paul Bogdan, University of Southern California, US
Abstract
With increasing demand for distributed intelligent physical systems performing big data analytics on the field and in real-time, processing-in-memory (PIM) architectures integrating 3D-stacked memory and logic layers could provide higher performance and energy efficiency. Towards this end, the PIM design requires principled and rigorous optimization strategies to identify interactions and manage data movement across different vaults. In this paper, we introduce Prometheus, a novel PIM-based framework that constructs a comprehensive model of computation and communication (MoCC) based on a static and dynamic compilation of an application. Firstly, by adopting a low level virtual machine (LLVM) intermediate representation (IR), an input application is modeled as a two-layered graph consisting of (i) a computation layer in which the nodes denote computation IR instructions and edges denote data dependencies among instructions, and (ii) a communication layer in which the nodes denote memory operations (e.g., load/store) and edges represent memory dependencies detected by alias analysis. Secondly, we develop an optimization framework that partitions the multi-layer network into processing communities within which the computational workload is maximized while balancing the load among computational clusters. Thirdly, we propose a community-to-vault mapping algorithm for designing a scalable hybrid memory cube (HMC)-based system where vaults are interconnected through a network-on-chip (NoC) approach rather than a crossbar architecture. This ensures scalability to hundreds of vaults in each cube. Experimental results demonstrate that Prometheus consisting of 64 HMC-based vaults improves system performance by 9.8x and achieves 2.3x energy reduction, compared to conventional systems.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:15

11.4.4

ADVANCING SOURCE-LEVEL TIMING SIMULATION USING LOOP ACCELERATION
Speaker:
Joscha Benz, University of Tuebingen, DE
Authors:
Joscha Benz¹, Christoph Gerum¹ and Oliver Bringmann²
¹University of Tuebingen, DE; ²University of Tuebingen / FZI, DE
Abstract
Source-level timing simulation (STLS) is an important technique for early examination of timing behavior, as it is very fast and accurate. A factor occasionally more important than precision is simulation speed, especially in design space exploration or very early phases of development. Additionally, practices like rapid prototyping also benefit from high-performance timing simulation. Therefore, we propose to further reduce simulation run-time by utilizing a method called loop acceleration. Accelerating a loop in the context of SLTS means deriving the timing of a loop prior to simulation to increase simulation speed of that loop. We integrated this technique in our SLTS framework and conducted an comprehensive evaluation using the Mälardalen benchmark suite. We were able to reduce simulation time by up to 43% of the original time, while the introduced accuracy loss did not exceed 8 percentage points.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

IP5-13, 782

IN-MEMORY COMPUTING USING PATHS-BASED LOGIC AND HETEROGENEOUS COMPONENTS
Speaker:
Alvaro Velasquez, University of Central Florida, US
Authors:
Alvaro Velasquez and Sumit Kumar Jha, University of Central Florida, US
Abstract
The memory-processor bottleneck and scaling difficulties of the CMOS transistor have given rise to a plethora of research initiatives to overcome these challenges. Popular among these is in-memory crossbar computing. In this paper, we propose a framework for synthesizing logic-in-memory circuits based on the behavior of paths of electric current throughout the memory. Limitations of using only bidirectional components with this approach are also established. We demonstrate the effectiveness of our approach by generating n-bit addition circuits that can compute using a constant number of read and write cycles.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

End of session
Coffee Break in Exhibition Area

Coffee Breaks in the Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area (Terrace Level of the ICCD).

Lunch Breaks (Großer Saal + Saal 1)

On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the rooms "Großer Saal" and "Saal 1" (Saal Level of the ICCD) to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area.

Tuesday, March 20, 2018