11.5 Compile time and virtualization support for embedded system design

Time	Label	Presentation Title Authors
14:00	11.5.1	UNIFIED THREAD- AND DATA-MAPPING FOR MULTI-THREADED MULTI-PHASE APPLICATIONS ON SPM MANY-CORES Speaker: Anuj Pathania, National University of Singapore, SG Authors: Vanchinathan Venkataramani, Anuj Pathania and Tulika Mitra, National University of Singapore, SG Abstract Scratchpad Memories (SPMs) are more scalable than caches as they offer better performance with lower power and area overheads. This scalability advocates their suitability as on-chip memory in many-cores. However, SPM many-cores delegate the responsibility of thread- and data-mapping to the software. The mapping is especially challenging in the case of multi-threaded multi-phase applications. Threads from these applications exhibit both inter- and intra-phase data-sharing patterns. These patterns intricately intertwine thread- and data- mapping across phases. The accompanying qualitative mapping is the key to extract application performance on SPM many-cores. State-of-the-art framework for SPM many-cores performs thread- and data-mapping independently. Furthermore, it can only operate with single-phase multi-threaded applications. We are the first to propose in this work, a unified thread- and data-mapping framework for NoC-based SPM many-cores when executing multi-threaded multi-phase applications. Experimental evaluations show, on average, 1.36x performance improvement compared to the state-of-the-art framework for multi-threaded multi-phase applications. Download Paper (PDF; Only available from the DATE venue WiFi)
14:30	11.5.2	GENERALIZED DATA PLACEMENT STRATEGIES FOR RACETRACK MEMORIES Speaker: Asif Ali Khan, TU Dresden, DE Authors: Asif Ali Khan, Andres Goens, Fazal Hameed and Jeronimo Castrillon, TU Dresden, DE Abstract Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs hinder their applicability to replace low-latency on-chip memories. Recent research has demonstrated that intelligent placement of memory objects in RTMs can significantly reduce the amount of shifts with no hardware overhead, albeit for specific system setups. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. In this paper we look at generalized data placement mechanisms that improve upon existing ones by taking into account the underlying memory architecture and the timing and liveliness information of memory objects. We propose a novel heuristic and a formulation using genetic algorithms that optimize key performance parameters. We show that, on average, our generalized approach improves the number of shifts, performance and energy consumption by 4.3×, 46% and 55% respectively compared to the state-of-the-art. Download Paper (PDF; Only available from the DATE venue WiFi)
15:00	11.5.3	ARM-ON-ARM: LEVERAGING VIRTUALIZATION EXTENSIONS FOR FAST VIRTUAL PLATFORMS Speaker: Lukas Jünger, RWTH Aachen University, DE Authors: Lukas Jünger¹, Jan Luca Malte Bölke², Stephan Tobies², Rainer Leupers¹ and Andreas Hoffmann² ¹RWTH Aachen University, DE; ²Synopsys GmbH, DE Abstract Virtual Platforms (VPs) are an essential enabling technology in the System-on-a-Chip (SoC) development cycle. They are used for early software development and hardware/software codesign. However, since virtual prototyping is limited by simulation performance, improving the simulation speed of VPs has been an active research topic for years. Different strategies have been proposed, such as fast instruction set simulation using Dynamic Binary Translation (DBT). But even fast simulators do not reach native execution speed. They do however allow executing rich Operating System (OS) kernels, which is typically infeasible when another OS is already running. Executing multiple OSs on shared physical hardware is typically accomplished by using virtualization, which has a long history on x86 hardware. It enables encapsulated, native code execution on the host processor and has been extensively used in data centers, where many users share hardware resources. When it comes to embedded systems, virtualization has been made available recently. For ARM processors, virtualization was introduced with the ARM Virtualization Extensions for the ARMv7 architecture. Since virtualization allows native guest code execution, near-native execution speeds can be reached. In this work we present a VP containing a novel ARMv8 SystemC Transaction Level Modeling 2.0 (TLM) compatible processor model. The model leverages the ARM Virtualization Extensions (VE) via the Linux Kernel-based Virtual Machine (KVM) to execute the target software natively on an ARMv8 host. To enable the integration of the processor model into a loosely-timed VP, we developed an accurate instruction counting mechanism using the ARM Performance Monitors Extension (PMU). The requirements for integrating the processor mode into a VP and the integration process are detailed in this work. Our evaluations show that speedups of up to 2.57x over state-of-the-art DBT-based simulator can be achieved using our processor model on ARMv8 hardware. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30	IP5-8, 597	TDO-CIM: TRANSPARENT DETECTION AND OFFLOADING FOR COMPUTATION IN-MEMORY Speaker: Lorenzo Chelini, Eindhoven University of Technology, NL Authors: Kanishkan Vadivel¹, Lorenzo Chelini², Ali BanaGozar¹, Gagandeep Singh², Stefano Corda², Roel Jordans¹ and Henk Corporaal¹ ¹Eindhoven University of Technology, NL; ²IBM Research, CH Abstract Computation in-memory is a promising non-von Neumann approach aiming at completely diminishing the data transfer to and from the memory subsystem. Although a lot of architectures have been proposed, compiler support for such architectures is still lagging behind. In this paper, we close this gap by proposing an end-to-end compilation flow for in-memory computing based on the LLVM compiler infrastructure. Starting from sequential code, our approach automatically detects, optimizes, and offloads kernels suitable for in-memory acceleration. We demonstrate our compiler tool-flow on the PolyBench/C benchmark suite and evaluate the benefits of our proposed in-memory architecture simulated in Gem5 by comparing it with a state-of-the-art von Neumann architecture. Download Paper (PDF; Only available from the DATE venue WiFi)
15:33	IP5-9, 799	BACKFLOW: BACKWARD EDGE CONTROL FLOW ENFORCEMENT FOR LOW END ARM MICROCONTROLLERS Speaker: Cyril Bresch, LCIS, FR Authors: Cyril Bresch¹, David Hély² and Roman Lysecky³ ¹LCIS, FR; ²LCIS - Grenoble INP, FR; ³University of Arizona, US Abstract This paper presents BackFlow, a compiler-based toolchain that enforces indirect backward edge control flow integrity for low-end ARM Cortex-M microprocessors. BackFlow is implemented within the Clang/LLVM compiler and supports the ARM instruction set and its subset Thumb. The control flow integrity generated by the compiler relies on a bitmap, where each set bit indicates a valid pointer destination. The efficiency of the framework is benchmarked using an STM32 NUCLEO F446RE microcontroller. The obtained results show that the control flow integrity solution incurs an execution time overhead ranging from 1.5 to 4.5%. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30		End of session

Time

Label

Presentation Title
Authors

14:00

11.5.1

UNIFIED THREAD- AND DATA-MAPPING FOR MULTI-THREADED MULTI-PHASE APPLICATIONS ON SPM MANY-CORES
Speaker:
Anuj Pathania, National University of Singapore, SG
Authors:
Vanchinathan Venkataramani, Anuj Pathania and Tulika Mitra, National University of Singapore, SG
Abstract
Scratchpad Memories (SPMs) are more scalable than caches as they offer better performance with lower power and area overheads. This scalability advocates their suitability as on-chip memory in many-cores. However, SPM many-cores delegate the responsibility of thread- and data-mapping to the software. The mapping is especially challenging in the case of multi-threaded multi-phase applications. Threads from these applications exhibit both inter- and intra-phase data-sharing patterns. These patterns intricately intertwine thread- and data- mapping across phases. The accompanying qualitative mapping is the key to extract application performance on SPM many-cores. State-of-the-art framework for SPM many-cores performs thread- and data-mapping independently. Furthermore, it can only operate with single-phase multi-threaded applications. We are the first to propose in this work, a unified thread- and data-mapping framework for NoC-based SPM many-cores when executing multi-threaded multi-phase applications. Experimental evaluations show, on average, 1.36x performance improvement compared to the state-of-the-art framework for multi-threaded multi-phase applications.
Download Paper (PDF; Only available from the DATE venue WiFi)

14:30

11.5.2

GENERALIZED DATA PLACEMENT STRATEGIES FOR RACETRACK MEMORIES
Speaker:
Asif Ali Khan, TU Dresden, DE
Authors:
Asif Ali Khan, Andres Goens, Fazal Hameed and Jeronimo Castrillon, TU Dresden, DE
Abstract
Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs hinder their applicability to replace low-latency on-chip memories. Recent research has demonstrated that intelligent placement of memory objects in RTMs can significantly reduce the amount of shifts with no hardware overhead, albeit for specific system setups. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. In this paper we look at generalized data placement mechanisms that improve upon existing ones by taking into account the underlying memory architecture and the timing and liveliness information of memory objects. We propose a novel heuristic and a formulation using genetic algorithms that optimize key performance parameters. We show that, on average, our generalized approach improves the number of shifts, performance and energy consumption by 4.3×, 46% and 55% respectively compared to the state-of-the-art.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:00

11.5.3

ARM-ON-ARM: LEVERAGING VIRTUALIZATION EXTENSIONS FOR FAST VIRTUAL PLATFORMS
Speaker:
Lukas Jünger, RWTH Aachen University, DE
Authors:
Lukas Jünger¹, Jan Luca Malte Bölke², Stephan Tobies², Rainer Leupers¹ and Andreas Hoffmann²
¹RWTH Aachen University, DE; ²Synopsys GmbH, DE
Abstract
Virtual Platforms (VPs) are an essential enabling technology in the System-on-a-Chip (SoC) development cycle. They are used for early software development and hardware/software codesign. However, since virtual prototyping is limited by simulation performance, improving the simulation speed of VPs has been an active research topic for years. Different strategies have been proposed, such as fast instruction set simulation using Dynamic Binary Translation (DBT). But even fast simulators do not reach native execution speed. They do however allow executing rich Operating System (OS) kernels, which is typically infeasible when another OS is already running. Executing multiple OSs on shared physical hardware is typically accomplished by using virtualization, which has a long history on x86 hardware. It enables encapsulated, native code execution on the host processor and has been extensively used in data centers, where many users share hardware resources. When it comes to embedded systems, virtualization has been made available recently. For ARM processors, virtualization was introduced with the ARM Virtualization Extensions for the ARMv7 architecture. Since virtualization allows native guest code execution, near-native execution speeds can be reached. In this work we present a VP containing a novel ARMv8 SystemC Transaction Level Modeling 2.0 (TLM) compatible processor model. The model leverages the ARM Virtualization Extensions (VE) via the Linux Kernel-based Virtual Machine (KVM) to execute the target software natively on an ARMv8 host. To enable the integration of the processor model into a loosely-timed VP, we developed an accurate instruction counting mechanism using the ARM Performance Monitors Extension (PMU). The requirements for integrating the processor mode into a VP and the integration process are detailed in this work. Our evaluations show that speedups of up to 2.57x over state-of-the-art DBT-based simulator can be achieved using our processor model on ARMv8 hardware.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

IP5-8, 597

TDO-CIM: TRANSPARENT DETECTION AND OFFLOADING FOR COMPUTATION IN-MEMORY
Speaker:
Lorenzo Chelini, Eindhoven University of Technology, NL
Authors:
Kanishkan Vadivel¹, Lorenzo Chelini², Ali BanaGozar¹, Gagandeep Singh², Stefano Corda², Roel Jordans¹ and Henk Corporaal¹
¹Eindhoven University of Technology, NL; ²IBM Research, CH
Abstract
Computation in-memory is a promising non-von Neumann approach aiming at completely diminishing the data transfer to and from the memory subsystem. Although a lot of architectures have been proposed, compiler support for such architectures is still lagging behind. In this paper, we close this gap by proposing an end-to-end compilation flow for in-memory computing based on the LLVM compiler infrastructure. Starting from sequential code, our approach automatically detects, optimizes, and offloads kernels suitable for in-memory acceleration. We demonstrate our compiler tool-flow on the PolyBench/C benchmark suite and evaluate the benefits of our proposed in-memory architecture simulated in Gem5 by comparing it with a state-of-the-art von Neumann architecture.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:33

IP5-9, 799

BACKFLOW: BACKWARD EDGE CONTROL FLOW ENFORCEMENT FOR LOW END ARM MICROCONTROLLERS
Speaker:
Cyril Bresch, LCIS, FR
Authors:
Cyril Bresch¹, David Hély² and Roman Lysecky³
¹LCIS, FR; ²LCIS - Grenoble INP, FR; ³University of Arizona, US
Abstract
This paper presents BackFlow, a compiler-based toolchain that enforces indirect backward edge control flow integrity for low-end ARM Cortex-M microprocessors. BackFlow is implemented within the Clang/LLVM compiler and supports the ARM instruction set and its subset Thumb. The control flow integrity generated by the compiler relies on a bitmap, where each set bit indicates a valid pointer destination. The efficiency of the framework is benchmarked using an STM32 NUCLEO F446RE microcontroller. The obtained results show that the control flow integrity solution incurs an execution time overhead ranging from 1.5 to 4.5%.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

End of session