8.7 Instruction-level and thread-level parallelism in embedded systems

Time	Label	Presentation Title Authors
17:00	8.7.1	HARDWARE-ACCELERATED DYNAMIC BINARY TRANSLATION Speaker: Simon Rokicki, Université de Rennes 1 / IRISA, FR Authors: Simon Rokicki¹, Erven Rohou² and Steven Derrien¹ ¹Irisa, FR; ²Inria, FR Abstract Dynamic Binary Translation (DBT) is often used in hardware/software co-design to take advantage of an architecture model while using binaries from another one. The co-development of the DBT engine and of the execution architecture leads to architecture with special support to these mechanisms. In this work, we propose a hardware accelerated Dynamic Binary Translation where the first steps of the DBT process are fully accelerated in hardware. Results shows that using our hardware accelerators leads to a speed-up of 8x and a cost in energy 18x lower, compared with an equivalent software approach. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30	8.7.2	SUPERWORD LEVEL PARALLELISM AWARE WORD LENGTH OPTIMIZATION Speaker: Ali Hassan El Moussawi, IRISA, FR Authors: Ali Hassan El Moussawi¹ and Steven Derrien² ¹INRIA, FR; ²IRISA, FR Abstract Many embedded processors do not support floating-point arithmetic in order to comply with strict cost and power consumption constraints. But, they generally provide support for SIMD as a mean to improve performance for little cost overhead. Achieving good performance when targeting such processors requires the use of fixed-point arithmetic and efficient exploitation of SIMD data-path. To reduce time-to-market, automatic SIMDization -- such as superword level parallelism (SLP) extraction -- and floating-point to fixed-point conversion methodologies have been proposed. In this paper we show that applying these transformations independently is not efficient. We propose a SLP-aware word length optimization algorithm to jointly perform float-to-fixed-point conversion and SLP extraction. We implement the proposed approach in a source-to-source compiler framework and evaluate it on several embedded processors. Experimental results illustrate the validity of our approach. Download Paper (PDF; Only available from the DATE venue WiFi)
18:00	8.7.3	SCHEDULABILITY-AWARE SPM ALLOCATION FOR PREEMPTIVE HARD REAL-TIME SYSTEMS WITH ARBITRARY ACTIVATION PATTERNS Speaker: Arno Luppold, Hamburg University of Technology, DE Authors: Arno Luppold¹ and Heiko Falk² ¹Hamburg University of Technology, DE; ²Hamburg University of Technology (TUHH), DE Abstract In hard real-time multi-tasking systems each task has to meet its deadline under any circumstances. If one or several tasks violate their timing constraints, compiler optimizations can be used to optimize the Worst-Case Execution Time (WCET) of each task with a focus on the system's schedulability. Existing approaches are limited to single-tasking or strictly periodic multi-tasking systems. We propose a compiler optimization to perform a schedulability-aware static instruction Scratchpad Allocation for arbitrary activation patterns and deadlines. The approach is based on Integer-Linear Programming and is evaluated for the Infineon TriCore TC1796 microcontroller. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30	IP4-4, 636	SCHEDULE-AWARE LOOP PARALLELIZATION FOR EMBEDDED MPSOCS BY EXPLOITING PARALLEL SLACK Speaker: Miguel Angel Aguilar, RWTH Aachen University, DE Authors: Miguel Angel Aguilar¹, Rainer Leupers¹, Gerd Ascheid¹, Nikolaos Kavvadias² and Liam Fitzpatrick² ¹RWTH Aachen University, DE; ²Silexica Software Solutions GmbH, DE Abstract MPSoC programming is still a challenging task, where several aspects have to be taken into account to achieve a profitable parallel execution. Selecting a proper scheduling policy is an aspect that has a major impact on the performance. OpenMP is an example of a programming paradigm that allows to specify the scheduling policy on a per loop basis. However, choosing the best scheduling policy and the corresponding parameters is not a trivial task. In fact, there is already a large amount of software parallelized with OpenMP, where the scheduling policy is not explicitly specified. Then, the scheduling decision is left to the default runtime, which in most of the cases does not yield the best performance. In this paper, we present a schedule-aware optimization approach enabled by exploiting the parallel slack existing in loops parallelized with OpenMP. Results on an embedded multicore device, show that the performance achieved by OpenMP loops optimized with our approach outperform by up to 93%, the performance achieved by the original OpenMP loops, where the scheduling policy is not specified. Download Paper (PDF; Only available from the DATE venue WiFi)
18:31	IP4-5, 34	REDUCING CODE MANAGEMENT OVERHEAD IN SOFTWARE-MANAGED MULTICORES Speaker: Aviral Shrivastava, Arizona State University, US Authors: Jian Cai¹, Yooseong Kim¹, Youngbin Kim², Aviral Shrivastava¹ and Kyoungwoo Lee² ¹Arizona State University, US; ²Yonsei University, KR Abstract Software-managed architectures, which use scratch- pad memories (SPMs), are a promising alternative to cached- based architectures for multicores. SPMs provide scalability but require explicit management. For example, to use an instruction SPM, explicit management code needs to be inserted around every call site to load functions to the SPM. such management code would check the state of the SPM and perform loading operations if necessary, which can cause considerable overhead at runtime. In this paper, we propose a compiler-based approach to reduce this overhead by identifying management code that can be removed or simplified. Our experiments with various benchmarks show that our approach reduces the execution time by 14% on average. In addition, compared to hardware caching, using our approach on an SPM-based architecture can reduce the execution times of the benchmarks by up to 15% Download Paper (PDF; Only available from the DATE venue WiFi)
18:32	IP4-6, 18	PERFORMANCE EVALUATION AND OPTIMIZATION OF HBM-ENABLED GPU FOR DATA-INTENSIVE APPLICATIONS Speaker: Yuan Xie, University of California, Santa Barbara, US Authors: Maohua Zhu¹, Youwei Zhuo², Chao Wang³, Wenguang Chen⁴ and Yuan Xie¹ ¹University of California, Santa Barbara, US; ²University of Southern California, US; ³University of Science and Technology of China, CN; ⁴Tsinghua University, CN Abstract Graphics Processing Units (GPUs) are widely used to accelerate data-intensive applications. To improve the performance of data-intensive applications, higher GPU memory bandwidth is desirable. Traditional GDDR memories achieve higher bandwidth by increasing frequency, which leads to excessive power consumption. Recently, a new memory technology called high-bandwidth memory (HBM) based on 3D die-stacking technology has been used in the latest generation of GPUs, which can provide both high bandwidth and low power consumption with in-package stacked DRAM memory. However, the capacity of integrated in-packaged stacked memory is limited (e.g. only 4GB for the state-of-the-art HBM-enabled GPU, AMD Radeon Fury X). In this paper, we implement two representative data-intensive applications, convolutional neural network (CNN) and breadth-first search (BFS) on an HBM-enabled GPU to evaluate the improvement brought by the adoption of the HBM, and investigate techniques to fully unleash the benefits of such HBM-enabled GPU. Based on the evaluation results, we first propose a software pipeline to alleviate the capacity limitation of the HBM for CNN. We then design two programming techniques to improve the utilization of memory bandwidth for BFS application. Experiment results demonstrate that our pipelined CNN training achieves a 1.63x speedup on an HBM enabled GPU compared with the best high-performance GPU in market, and the two optimization techniques for the BFS algorithm make it at most 24.5x(9.8x and 2.5x for each technique, respectively) faster than conventional implementations. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30		End of session

Time

Label

Presentation Title
Authors

17:00

8.7.1

HARDWARE-ACCELERATED DYNAMIC BINARY TRANSLATION
Speaker:
Simon Rokicki, Université de Rennes 1 / IRISA, FR
Authors:
Simon Rokicki¹, Erven Rohou² and Steven Derrien¹
¹Irisa, FR; ²Inria, FR
Abstract
Dynamic Binary Translation (DBT) is often used in hardware/software co-design to take advantage of an architecture model while using binaries from another one. The co-development of the DBT engine and of the execution architecture leads to architecture with special support to these mechanisms. In this work, we propose a hardware accelerated Dynamic Binary Translation where the first steps of the DBT process are fully accelerated in hardware. Results shows that using our hardware accelerators leads to a speed-up of 8x and a cost in energy 18x lower, compared with an equivalent software approach.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

8.7.2

SUPERWORD LEVEL PARALLELISM AWARE WORD LENGTH OPTIMIZATION
Speaker:
Ali Hassan El Moussawi, IRISA, FR
Authors:
Ali Hassan El Moussawi¹ and Steven Derrien²
¹INRIA, FR; ²IRISA, FR
Abstract
Many embedded processors do not support floating-point arithmetic in order to comply with strict cost and power consumption constraints. But, they generally provide support for SIMD as a mean to improve performance for little cost overhead. Achieving good performance when targeting such processors requires the use of fixed-point arithmetic and efficient exploitation of SIMD data-path. To reduce time-to-market, automatic SIMDization -- such as superword level parallelism (SLP) extraction -- and floating-point to fixed-point conversion methodologies have been proposed. In this paper we show that applying these transformations independently is not efficient. We propose a SLP-aware word length optimization algorithm to jointly perform float-to-fixed-point conversion and SLP extraction. We implement the proposed approach in a source-to-source compiler framework and evaluate it on several embedded processors. Experimental results illustrate the validity of our approach.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:00

8.7.3

SCHEDULABILITY-AWARE SPM ALLOCATION FOR PREEMPTIVE HARD REAL-TIME SYSTEMS WITH ARBITRARY ACTIVATION PATTERNS
Speaker:
Arno Luppold, Hamburg University of Technology, DE
Authors:
Arno Luppold¹ and Heiko Falk²
¹Hamburg University of Technology, DE; ²Hamburg University of Technology (TUHH), DE
Abstract
In hard real-time multi-tasking systems each task has to meet its deadline under any circumstances. If one or several tasks violate their timing constraints, compiler optimizations can be used to optimize the Worst-Case Execution Time (WCET) of each task with a focus on the system's schedulability. Existing approaches are limited to single-tasking or strictly periodic multi-tasking systems. We propose a compiler optimization to perform a schedulability-aware static instruction Scratchpad Allocation for arbitrary activation patterns and deadlines. The approach is based on Integer-Linear Programming and is evaluated for the Infineon TriCore TC1796 microcontroller.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

IP4-4, 636

SCHEDULE-AWARE LOOP PARALLELIZATION FOR EMBEDDED MPSOCS BY EXPLOITING PARALLEL SLACK
Speaker:
Miguel Angel Aguilar, RWTH Aachen University, DE
Authors:
Miguel Angel Aguilar¹, Rainer Leupers¹, Gerd Ascheid¹, Nikolaos Kavvadias² and Liam Fitzpatrick²
¹RWTH Aachen University, DE; ²Silexica Software Solutions GmbH, DE
Abstract
MPSoC programming is still a challenging task, where several aspects have to be taken into account to achieve a profitable parallel execution. Selecting a proper scheduling policy is an aspect that has a major impact on the performance. OpenMP is an example of a programming paradigm that allows to specify the scheduling policy on a per loop basis. However, choosing the best scheduling policy and the corresponding parameters is not a trivial task. In fact, there is already a large amount of software parallelized with OpenMP, where the scheduling policy is not explicitly specified. Then, the scheduling decision is left to the default runtime, which in most of the cases does not yield the best performance. In this paper, we present a schedule-aware optimization approach enabled by exploiting the parallel slack existing in loops parallelized with OpenMP. Results on an embedded multicore device, show that the performance achieved by OpenMP loops optimized with our approach outperform by up to 93%, the performance achieved by the original OpenMP loops, where the scheduling policy is not specified.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:31

IP4-5, 34

REDUCING CODE MANAGEMENT OVERHEAD IN SOFTWARE-MANAGED MULTICORES
Speaker:
Aviral Shrivastava, Arizona State University, US
Authors:
Jian Cai¹, Yooseong Kim¹, Youngbin Kim², Aviral Shrivastava¹ and Kyoungwoo Lee²
¹Arizona State University, US; ²Yonsei University, KR
Abstract
Software-managed architectures, which use scratch- pad memories (SPMs), are a promising alternative to cached- based architectures for multicores. SPMs provide scalability but require explicit management. For example, to use an instruction SPM, explicit management code needs to be inserted around every call site to load functions to the SPM. such management code would check the state of the SPM and perform loading operations if necessary, which can cause considerable overhead at runtime. In this paper, we propose a compiler-based approach to reduce this overhead by identifying management code that can be removed or simplified. Our experiments with various benchmarks show that our approach reduces the execution time by 14% on average. In addition, compared to hardware caching, using our approach on an SPM-based architecture can reduce the execution times of the benchmarks by up to 15%
Download Paper (PDF; Only available from the DATE venue WiFi)

18:32

IP4-6, 18

PERFORMANCE EVALUATION AND OPTIMIZATION OF HBM-ENABLED GPU FOR DATA-INTENSIVE APPLICATIONS
Speaker:
Yuan Xie, University of California, Santa Barbara, US
Authors:
Maohua Zhu¹, Youwei Zhuo², Chao Wang³, Wenguang Chen⁴ and Yuan Xie¹
¹University of California, Santa Barbara, US; ²University of Southern California, US; ³University of Science and Technology of China, CN; ⁴Tsinghua University, CN
Abstract
Graphics Processing Units (GPUs) are widely used to accelerate data-intensive applications. To improve the performance of data-intensive applications, higher GPU memory bandwidth is desirable. Traditional GDDR memories achieve higher bandwidth by increasing frequency, which leads to excessive power consumption. Recently, a new memory technology called high-bandwidth memory (HBM) based on 3D die-stacking technology has been used in the latest generation of GPUs, which can provide both high bandwidth and low power consumption with in-package stacked DRAM memory. However, the capacity of integrated in-packaged stacked memory is limited (e.g. only 4GB for the state-of-the-art HBM-enabled GPU, AMD Radeon Fury X). In this paper, we implement two representative data-intensive applications, convolutional neural network (CNN) and breadth-first search (BFS) on an HBM-enabled GPU to evaluate the improvement brought by the adoption of the HBM, and investigate techniques to fully unleash the benefits of such HBM-enabled GPU. Based on the evaluation results, we first propose a software pipeline to alleviate the capacity limitation of the HBM for CNN. We then design two programming techniques to improve the utilization of memory bandwidth for BFS application. Experiment results demonstrate that our pipelined CNN training achieves a 1.63x speedup on an HBM enabled GPU compared with the best high-performance GPU in market, and the two optimization techniques for the BFS algorithm make it at most 24.5x(9.8x and 2.5x for each technique, respectively) faster than conventional implementations.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

End of session