4.3 System Modelling for Simulation and Optimisation

Time	Label	Presentation Title Authors
17:00	4.3.1	(Best Paper Award Candidate) CAMP: ACCURATE MODELING OF CORE AND MEMORY LOCALITY FOR PROXY GENERATION OF BIG-DATA APPLICATIONS Speaker: Andreas Gerstlauer, University of Texas at Austin, US Authors: Reena Panda, Xinnian Zheng, Andreas Gerstlauer and Lizy John, The University of Texas at Austin, US Abstract Fast and accurate design-space exploration is a critical requirement for enabling future hardware designs. However, big-data applications are often complex targets to evaluate on early performance models (e.g., simulators or RTL models) owing to their complex software-stacks, significantly long run times, system dependencies and the limited speed of performance models. To overcome the challenges in benchmarking complex big-data applications, in this paper, we propose a proxy generation methodology, CAMP that can generate miniature proxy benchmarks, which are representative of the performance of big-data applications and yet converge to results quickly without needing any complex software stack support. Prior system-level proxy generation techniques model core locality features in detail, but abstract out memory locality modeling using simple stride-based models, which results in poor cloning accuracy for most applications. CAMP accurately models both core-performance and memory locality, along with modeling the feedback loop between the two. CAMP replicates core performance by modeling the dependencies between instructions, instruction types, control-flow behavior, etc. CAMP also adds a memory locality profiling approach that captures spatial and temporal locality of applications. Finally, we propose a novel proxy replay methodology that integrates the core and memory locality models to create accurate system-level proxy benchmarks. We demonstrate that CAMP proxies can mimic the original application's performance behavior and that they can capture the performance feedback loop well. For a variety of real-world big-data applications, we show that CAMP achieves an average cloning accuracy of 89%. We believe this is a new capability that can facilitate for overall system (core and memory subsystem) design exploration. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30	4.3.2	SMARTSHUTTLE: OPTIMIZING OFF-CHIP MEMORY ACCESSES FOR DEEP LEARNING ACCELERATORS Speaker: Guihai Yan, Institute of Computing Technology, Chinese Academy of Sciences, CN Authors: Jiajun Li, Guihai Yan, Wenyan Lu, Shuhao Jiang, Shijun Gong, Jingya Wu and Xiaowei Li, Institute of Computing Technology, Chinese Academy of Sciences, CN Abstract Convolutional Neural Network(CNN) accelerators are rapidly growing in popularity as a promising solution for deep learning based applications. Though optimizations on computation have been intensively studied, the energy efficiency of such accelerators remains limited by off-chip memory accesses since their energy cost is magnitudes higher than other operations. Minimizing off-chip memory access volume, therefore, is the key to higher energy efficiency. However, there exists a dilemma of minimizing the access of which data types. We observed that sticking to minimizing the access of one data type cannot fit the varying shapes of convolutional layers in CNNs. To overcome this problem, this paper proposed a adaptive layer partitioning and scheduling scheme, called SmartShuttle, which can adaptively switch among the specific data reuse oriented scheduling schemes and the corresponding layer partitioning schemes to dynamically match different shapes of convolutional layers. Specifically, SmartShuttle takes both data reusability and sparsity into account since they have significant impact on the memory access volume. The experimental results sho that SmartShuttle achieves a performance at 434.8 multiply and accumulations(MACs)/DRAM access for VGG-16, and 526.3 MACs/DRAM access for AlexNet, which outperforms the state-of-the-art approach (Eyeriss) by 52.2% and 52.6%, respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
18:00	4.3.3	PORT CALL PATH SENSITIVE CONFLICT ANALYSIS FOR INSTANCE-AWARE PARALLEL SYSTEMC SIMULATION Speaker: Tim Schmidt, Student, US Authors: Tim Schmidt, Zhongqi Cheng and Rainer Doemer, University of California, Irvine, US Abstract Many parallel SystemC approaches expect a thread safe and conflict free model from the designer. Alternatively, an advanced compiler can identify and avoid possible parallel access conflicts. While manual conflict resolution can theoretically be more precise, it is impractical for real world applications because of the inherent complexities. Here automatic compiler-based analysis is preferred which provides conservative conflict avoidance with minimal false positives. This paper introduces a novel compiler technique called port call path analysis that greatly reduces the amount of false positive conflicts resulting in significantly increased simulation speed. Experimental results show that the new analysis reduces the amount of false conflicts by up to 98% and, on a 4-core processor, speeds up the simulation up to 3x for a NoC particle simulator and 3.5x for a bitcoin miner SystemC model. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30	IP1-11, 142	BRIDGING DISCRETE AND CONTINUOUS TIME MODELS WITH ATOMS Speaker: George Ungureanu, KTH Royal Institute of Technology, SE Authors: George Ungureanu¹, José E. G. de Medeiros² and Ingo Sander¹ ¹KTH Royal Institute of Technology, SE; ²University of Brasília, BR Abstract Recent trends in replacing traditionally digital components with analog counterparts in order to overcome physical limitations have led to an increasing need for rigorous modeling and simulation of hybrid systems. Combining the two domains under the same set of semantics is not straightforward and often leads to chaotic and non-deterministic behavior due to the lack of a common understanding of aspects concerning time. We propose an algebra of primitive interactions between continuous and discrete aspects of systems which enables their description within two orthogonal layers of computation. We show its benefits from the perspective of modeling and simulation, through the example of an RC oscillator modeled in a formal framework implementing this algebra. Download Paper (PDF; Only available from the DATE venue WiFi)
18:31	IP1-12, 436	OHEX: OS-AWARE HYBRIDIZATION TECHNIQUES FOR ACCELERATING MPSOC FULL-SYSTEM SIMULATION Speaker: Róbert Lajos Bücs, Institute for Communication Technologies and Embedded Systems, RWTH Aachen University, DE Authors: Róbert Lajos Bücs¹, Maximilian Fricke², Rainer Leupers¹, Gerd Ascheid¹, Stephan Tobies² and Andreas Hoffmann² ¹Institute for Communication Technologies and Embedded Systems, RWTH Aachen University, DE; ²Synopsys GmbH, DE Abstract Virtual platform (VP) technology is an established enabler of embedded system design. However, the sheer number of CPU models in modern multi-core VPs forms a performance bottleneck. Hybrid simulation addresses this issue by executing parts of the embedded software stack on the host. Although the approach is significantly faster, hybridization can not cope with higher software layers, e.g., Operating Systems (OSs). Thus, this paper presents the OS-aware Host EXtension (OHEX) framework to accelerate VPs while expanding the applicability of hybridization. OHEX is evaluated on various system layers, yielding speedups between 2.99x-21.14x with specific benchmarks. Download Paper (PDF; Only available from the DATE venue WiFi)
18:32	IP1-13, 144	A HIGHLY EFFICIENT FULL-SYSTEM VIRTUAL PROTOTYPE BASED ON VIRTUALIZATION-ASSISTED APPROACH Speaker: Hsin-I Wu, National Tsing Hua University, Department of Computer Science, Hsinchu, Taiwan, TW Authors: Hsin-I Wu, Chi-Kang Chen, Tsung-Ying Lu and Ren-Song Tsay, National Tsing Hua University, TW Abstract An effective full-system virtual prototype is critical for early-stage systems design exploration. Generally, however, traditional acceleration approaches of virtual prototypes cannot accurately analyze system performance and model non-deterministic inter-component interactions due to the unpredictability of simulation progress. In this paper, we propose an effective virtualization-assisted approach for modeling and performance analysis. First, we develop a deterministic synchronization process that manages the interactions affecting the data dependency in chronological order to model inter-component interactions consistently. Next, we create accurate timing and bus contention models based on runtime operation statistics for analyzing system performance. We implement the proposed virtualization-assisted approach on an off-the-shelf System-on-Chip (SoC) board to demonstrate the effectiveness of our idea. The experimental results show that the proposed approach runs 12~77 times faster than a commercial virtual prototyping tool and performance estimation is only 3~6% apart from real systems. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30		End of session Exhibition Reception in Exhibition Area The Exhibition Reception will take place on Tuesday in the exhibition area, where free drinks for all conference delegates and exhibition visitors will be offered. All exhibitors are welcome to also provide drinks and snacks for the attendees.

Time

Label

Presentation Title
Authors

17:00

4.3.1

(Best Paper Award Candidate)
CAMP: ACCURATE MODELING OF CORE AND MEMORY LOCALITY FOR PROXY GENERATION OF BIG-DATA APPLICATIONS
Speaker:
Andreas Gerstlauer, University of Texas at Austin, US
Authors:
Reena Panda, Xinnian Zheng, Andreas Gerstlauer and Lizy John, The University of Texas at Austin, US
Abstract
Fast and accurate design-space exploration is a critical requirement for enabling future hardware designs. However, big-data applications are often complex targets to evaluate on early performance models (e.g., simulators or RTL models) owing to their complex software-stacks, significantly long run times, system dependencies and the limited speed of performance models. To overcome the challenges in benchmarking complex big-data applications, in this paper, we propose a proxy generation methodology, CAMP that can generate miniature proxy benchmarks, which are representative of the performance of big-data applications and yet converge to results quickly without needing any complex software stack support. Prior system-level proxy generation techniques model core locality features in detail, but abstract out memory locality modeling using simple stride-based models, which results in poor cloning accuracy for most applications. CAMP accurately models both core-performance and memory locality, along with modeling the feedback loop between the two. CAMP replicates core performance by modeling the dependencies between instructions, instruction types, control-flow behavior, etc. CAMP also adds a memory locality profiling approach that captures spatial and temporal locality of applications. Finally, we propose a novel proxy replay methodology that integrates the core and memory locality models to create accurate system-level proxy benchmarks. We demonstrate that CAMP proxies can mimic the original application's performance behavior and that they can capture the performance feedback loop well. For a variety of real-world big-data applications, we show that CAMP achieves an average cloning accuracy of 89%. We believe this is a new capability that can facilitate for overall system (core and memory subsystem) design exploration.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

4.3.2

SMARTSHUTTLE: OPTIMIZING OFF-CHIP MEMORY ACCESSES FOR DEEP LEARNING ACCELERATORS
Speaker:
Guihai Yan, Institute of Computing Technology, Chinese Academy of Sciences, CN
Authors:
Jiajun Li, Guihai Yan, Wenyan Lu, Shuhao Jiang, Shijun Gong, Jingya Wu and Xiaowei Li, Institute of Computing Technology, Chinese Academy of Sciences, CN
Abstract
Convolutional Neural Network(CNN) accelerators are rapidly growing in popularity as a promising solution for deep learning based applications. Though optimizations on computation have been intensively studied, the energy efficiency of such accelerators remains limited by off-chip memory accesses since their energy cost is magnitudes higher than other operations. Minimizing off-chip memory access volume, therefore, is the key to higher energy efficiency. However, there exists a dilemma of minimizing the access of which data types. We observed that sticking to minimizing the access of one data type cannot fit the varying shapes of convolutional layers in CNNs. To overcome this problem, this paper proposed a adaptive layer partitioning and scheduling scheme, called SmartShuttle, which can adaptively switch among the specific data reuse oriented scheduling schemes and the corresponding layer partitioning schemes to dynamically match different shapes of convolutional layers. Specifically, SmartShuttle takes both data reusability and sparsity into account since they have significant impact on the memory access volume. The experimental results sho that SmartShuttle achieves a performance at 434.8 multiply and accumulations(MACs)/DRAM access for VGG-16, and 526.3 MACs/DRAM access for AlexNet, which outperforms the state-of-the-art approach (Eyeriss) by 52.2% and 52.6%, respectively.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:00

4.3.3

PORT CALL PATH SENSITIVE CONFLICT ANALYSIS FOR INSTANCE-AWARE PARALLEL SYSTEMC SIMULATION
Speaker:
Tim Schmidt, Student, US
Authors:
Tim Schmidt, Zhongqi Cheng and Rainer Doemer, University of California, Irvine, US
Abstract
Many parallel SystemC approaches expect a thread safe and conflict free model from the designer. Alternatively, an advanced compiler can identify and avoid possible parallel access conflicts. While manual conflict resolution can theoretically be more precise, it is impractical for real world applications because of the inherent complexities. Here automatic compiler-based analysis is preferred which provides conservative conflict avoidance with minimal false positives. This paper introduces a novel compiler technique called port call path analysis that greatly reduces the amount of false positive conflicts resulting in significantly increased simulation speed. Experimental results show that the new analysis reduces the amount of false conflicts by up to 98% and, on a 4-core processor, speeds up the simulation up to 3x for a NoC particle simulator and 3.5x for a bitcoin miner SystemC model.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

IP1-11, 142

BRIDGING DISCRETE AND CONTINUOUS TIME MODELS WITH ATOMS
Speaker:
George Ungureanu, KTH Royal Institute of Technology, SE
Authors:
George Ungureanu¹, José E. G. de Medeiros² and Ingo Sander¹
¹KTH Royal Institute of Technology, SE; ²University of Brasília, BR
Abstract
Recent trends in replacing traditionally digital components with analog counterparts in order to overcome physical limitations have led to an increasing need for rigorous modeling and simulation of hybrid systems. Combining the two domains under the same set of semantics is not straightforward and often leads to chaotic and non-deterministic behavior due to the lack of a common understanding of aspects concerning time. We propose an algebra of primitive interactions between continuous and discrete aspects of systems which enables their description within two orthogonal layers of computation. We show its benefits from the perspective of modeling and simulation, through the example of an RC oscillator modeled in a formal framework implementing this algebra.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:31

IP1-12, 436

OHEX: OS-AWARE HYBRIDIZATION TECHNIQUES FOR ACCELERATING MPSOC FULL-SYSTEM SIMULATION
Speaker:
Róbert Lajos Bücs, Institute for Communication Technologies and Embedded Systems, RWTH Aachen University, DE
Authors:
Róbert Lajos Bücs¹, Maximilian Fricke², Rainer Leupers¹, Gerd Ascheid¹, Stephan Tobies² and Andreas Hoffmann²
¹Institute for Communication Technologies and Embedded Systems, RWTH Aachen University, DE; ²Synopsys GmbH, DE
Abstract
Virtual platform (VP) technology is an established enabler of embedded system design. However, the sheer number of CPU models in modern multi-core VPs forms a performance bottleneck. Hybrid simulation addresses this issue by executing parts of the embedded software stack on the host. Although the approach is significantly faster, hybridization can not cope with higher software layers, e.g., Operating Systems (OSs). Thus, this paper presents the OS-aware Host EXtension (OHEX) framework to accelerate VPs while expanding the applicability of hybridization. OHEX is evaluated on various system layers, yielding speedups between 2.99x-21.14x with specific benchmarks.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:32

IP1-13, 144

A HIGHLY EFFICIENT FULL-SYSTEM VIRTUAL PROTOTYPE BASED ON VIRTUALIZATION-ASSISTED APPROACH
Speaker:
Hsin-I Wu, National Tsing Hua University, Department of Computer Science, Hsinchu, Taiwan, TW
Authors:
Hsin-I Wu, Chi-Kang Chen, Tsung-Ying Lu and Ren-Song Tsay, National Tsing Hua University, TW
Abstract
An effective full-system virtual prototype is critical for early-stage systems design exploration. Generally, however, traditional acceleration approaches of virtual prototypes cannot accurately analyze system performance and model non-deterministic inter-component interactions due to the unpredictability of simulation progress. In this paper, we propose an effective virtualization-assisted approach for modeling and performance analysis. First, we develop a deterministic synchronization process that manages the interactions affecting the data dependency in chronological order to model inter-component interactions consistently. Next, we create accurate timing and bus contention models based on runtime operation statistics for analyzing system performance. We implement the proposed virtualization-assisted approach on an off-the-shelf System-on-Chip (SoC) board to demonstrate the effectiveness of our idea. The experimental results show that the proposed approach runs 12~77 times faster than a commercial virtual prototyping tool and performance estimation is only 3~6% apart from real systems.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

End of session
Exhibition Reception in Exhibition Area
The Exhibition Reception will take place on Tuesday in the exhibition area, where free drinks for all conference delegates and exhibition visitors will be offered. All exhibitors are welcome to also provide drinks and snacks for the attendees.