8.7 Instruction-level and thread-level parallelism in embedded systems

Printer-friendly version PDF version

Date: Wednesday 29 March 2017
Time: 17:00 - 18:30
Location / Room: 3B

Chair:
Oliver Bringmann, Universität Tübingen, DE

Co-Chair:
Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE

The first paper in this session presents a novel open-source hardware/software infrastructure for dynamic binary translation. The second paper presents a mechanism to improve the floating point to fixed point conversion by exploiting word-level parallelism. The third paper presents a WCET analysis for multiple tasks on single-core systems.

TimeLabelPresentation Title
Authors
17:008.7.1HARDWARE-ACCELERATED DYNAMIC BINARY TRANSLATION
Speaker:
Simon Rokicki, Université de Rennes 1 / IRISA, FR
Authors:
Simon Rokicki1, Erven Rohou2 and Steven Derrien1
1Irisa, FR; 2Inria, FR
Abstract
Dynamic Binary Translation (DBT) is often used in hardware/software co-design to take advantage of an architecture model while using binaries from another one. The co-development of the DBT engine and of the execution architecture leads to architecture with special support to these mechanisms. In this work, we propose a hardware accelerated Dynamic Binary Translation where the first steps of the DBT process are fully accelerated in hardware. Results shows that using our hardware accelerators leads to a speed-up of 8x and a cost in energy 18x lower, compared with an equivalent software approach.

Download Paper (PDF; Only available from the DATE venue WiFi)
17:308.7.2SUPERWORD LEVEL PARALLELISM AWARE WORD LENGTH OPTIMIZATION
Speaker:
Ali Hassan El Moussawi, IRISA, FR
Authors:
Ali Hassan El Moussawi1 and Steven Derrien2
1INRIA, FR; 2IRISA, FR
Abstract
Many embedded processors do not support floating-point arithmetic in order to comply with strict cost and power consumption constraints. But, they generally provide support for SIMD as a mean to improve performance for little cost overhead. Achieving good performance when targeting such processors requires the use of fixed-point arithmetic and efficient exploitation of SIMD data-path. To reduce time-to-market, automatic SIMDization -- such as superword level parallelism (SLP) extraction -- and floating-point to fixed-point conversion methodologies have been proposed. In this paper we show that applying these transformations independently is not efficient. We propose a SLP-aware word length optimization algorithm to jointly perform float-to-fixed-point conversion and SLP extraction. We implement the proposed approach in a source-to-source compiler framework and evaluate it on several embedded processors. Experimental results illustrate the validity of our approach.

Download Paper (PDF; Only available from the DATE venue WiFi)
18:008.7.3SCHEDULABILITY-AWARE SPM ALLOCATION FOR PREEMPTIVE HARD REAL-TIME SYSTEMS WITH ARBITRARY ACTIVATION PATTERNS
Speaker:
Arno Luppold, Hamburg University of Technology, DE
Authors:
Arno Luppold1 and Heiko Falk2
1Hamburg University of Technology, DE; 2Hamburg University of Technology (TUHH), DE
Abstract
In hard real-time multi-tasking systems each task has to meet its deadline under any circumstances. If one or several tasks violate their timing constraints, compiler optimizations can be used to optimize the Worst-Case Execution Time (WCET) of each task with a focus on the system's schedulability. Existing approaches are limited to single-tasking or strictly periodic multi-tasking systems. We propose a compiler optimization to perform a schedulability-aware static instruction Scratchpad Allocation for arbitrary activation patterns and deadlines. The approach is based on Integer-Linear Programming and is evaluated for the Infineon TriCore TC1796 microcontroller.

Download Paper (PDF; Only available from the DATE venue WiFi)
18:30IP4-4, 636SCHEDULE-AWARE LOOP PARALLELIZATION FOR EMBEDDED MPSOCS BY EXPLOITING PARALLEL SLACK
Speaker:
Miguel Angel Aguilar, RWTH Aachen University, DE
Authors:
Miguel Angel Aguilar1, Rainer Leupers1, Gerd Ascheid1, Nikolaos Kavvadias2 and Liam Fitzpatrick2
1RWTH Aachen University, DE; 2Silexica Software Solutions GmbH, DE
Abstract
MPSoC programming is still a challenging task, where several aspects have to be taken into account to achieve a profitable parallel execution. Selecting a proper scheduling policy is an aspect that has a major impact on the performance. OpenMP is an example of a programming paradigm that allows to specify the scheduling policy on a per loop basis. However, choosing the best scheduling policy and the corresponding parameters is not a trivial task. In fact, there is already a large amount of software parallelized with OpenMP, where the scheduling policy is not explicitly specified. Then, the scheduling decision is left to the default runtime, which in most of the cases does not yield the best performance. In this paper, we present a schedule-aware optimization approach enabled by exploiting the parallel slack existing in loops parallelized with OpenMP. Results on an embedded multicore device, show that the performance achieved by OpenMP loops optimized with our approach outperform by up to 93%, the performance achieved by the original OpenMP loops, where the scheduling policy is not specified.

Download Paper (PDF; Only available from the DATE venue WiFi)
18:31IP4-5, 34REDUCING CODE MANAGEMENT OVERHEAD IN SOFTWARE-MANAGED MULTICORES
Speaker:
Aviral Shrivastava, Arizona State University, US
Authors:
Jian Cai1, Yooseong Kim1, Youngbin Kim2, Aviral Shrivastava1 and Kyoungwoo Lee2
1Arizona State University, US; 2Yonsei University, KR
Abstract
Software-managed architectures, which use scratch- pad memories (SPMs), are a promising alternative to cached- based architectures for multicores. SPMs provide scalability but require explicit management. For example, to use an instruction SPM, explicit management code needs to be inserted around every call site to load functions to the SPM. such management code would check the state of the SPM and perform loading operations if necessary, which can cause considerable overhead at runtime. In this paper, we propose a compiler-based approach to reduce this overhead by identifying management code that can be removed or simplified. Our experiments with various benchmarks show that our approach reduces the execution time by 14% on average. In addition, compared to hardware caching, using our approach on an SPM-based architecture can reduce the execution times of the benchmarks by up to 15%

Download Paper (PDF; Only available from the DATE venue WiFi)
18:32IP4-6, 18PERFORMANCE EVALUATION AND OPTIMIZATION OF HBM-ENABLED GPU FOR DATA-INTENSIVE APPLICATIONS
Speaker:
Yuan Xie, University of California, Santa Barbara, US
Authors:
Maohua Zhu1, Youwei Zhuo2, Chao Wang3, Wenguang Chen4 and Yuan Xie1
1University of California, Santa Barbara, US; 2University of Southern California, US; 3University of Science and Technology of China, CN; 4Tsinghua University, CN
Abstract
Graphics Processing Units (GPUs) are widely used to accelerate data-intensive applications. To improve the performance of data-intensive applications, higher GPU memory bandwidth is desirable. Traditional GDDR memories achieve higher bandwidth by increasing frequency, which leads to excessive power consumption. Recently, a new memory technology called high-bandwidth memory (HBM) based on 3D die-stacking technology has been used in the latest generation of GPUs, which can provide both high bandwidth and low power consumption with in-package stacked DRAM memory. However, the capacity of integrated in-packaged stacked memory is limited (e.g. only 4GB for the state-of-the-art HBM-enabled GPU, AMD Radeon Fury X). In this paper, we implement two representative data-intensive applications, convolutional neural network (CNN) and breadth-first search (BFS) on an HBM-enabled GPU to evaluate the improvement brought by the adoption of the HBM, and investigate techniques to fully unleash the benefits of such HBM-enabled GPU. Based on the evaluation results, we first propose a software pipeline to alleviate the capacity limitation of the HBM for CNN. We then design two programming techniques to improve the utilization of memory bandwidth for BFS application. Experiment results demonstrate that our pipelined CNN training achieves a 1.63x speedup on an HBM enabled GPU compared with the best high-performance GPU in market, and the two optimization techniques for the BFS algorithm make it at most 24.5x(9.8x and 2.5x for each technique, respectively) faster than conventional implementations.

Download Paper (PDF; Only available from the DATE venue WiFi)
18:30End of session