7.5 Runtime support for multi/many cores

Time	Label	Presentation Title Authors
14:30	7.5.1	RESOURCE-AWARE MAPREDUCE RUNTIME FOR MULTI/MANY-CORE ARCHITECTURES Speaker: Konstantinos Iliakis, MicroLab, ECE, NTUA, GR Authors: Konstantinos Iliakis¹, Sotirios Xydis¹ and Dimitrios Soudris² ¹National TU Athens, GR; ²National Technical University of Athens, GR Abstract Modern multi/many-core processors exhibit high integration densities, e.g. up to several dozens or hundreds of cores. To ease the application development burden for such systems, various programming frameworks have emerged. The MapReduce programming model, after having demonstrated its usability in the area of distributed systems, has been adapted to the needs of shared-memory many-core and multi-processor systems, showing promising results in comparison with conventional multi-threaded libraries, e.g. pthreads. In this paper, we propose a novel resource-aware MapReduce architecture. The proposed runtime decouples map and combine phases in order to enhance the parallelism degree, while it effectively overlaps the memory-intensive combine with the compute-intensive map operation resulting in superior resource utilization and performance improvements. A detailed sensitivity analysis to the framework's tuning knobs is provided. The decoupled MapReduce architecture is evaluated against the state-of-art library into two diverse systems, i.e. a Haswell server and a Xeon Phi co-processor, demonstrating speedups on average up-to 2.2x and 2.9x respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
15:00	7.5.2	TOWARDS A QUALIFIABLE OPENMP FRAMEWORK FOR EMBEDDED SYSTEMS Speaker: Adrian Munera Sanchez, BSC, ES Authors: Adrián Munera Sánchez, Sara Royuela and Eduardo Quiñones, BSC, ES Abstract OpenMP is a very convenient parallel programming model to develop critical real-time applications by virtue of its powerful tasking model and its proven time predictable properties. However, current OpenMP implementations are not suitable due to the intensive use of dynamic memory to allocate data structures needed to efficiently manage the parallel execution. This jeopardizes the qualification processes of critical real-time systems, which are needed to ensure that the integrated system stack, including the OpenMP framework, is compliant with the system requirements. This paper proposes a novel OpenMP framework that statically allocates all the data structures needed to execute the OpenMP tasking model. Our framework is composed of a compiler phase that captures the data environment of all the OpenMP tasks instantiated along the parallel execution, and a run-time phase implementing a lazy task creation policy, that significantly reduces the memory requirements at run-time, whilst exploiting parallelism efficiently. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30	7.5.3	ENERGY-EFFICIENT RUNTIME RESOURCE MANAGEMENT FOR ADAPTABLE MULTI-APPLICATION MAPPING Speaker: Robert Khasanov, TU Dresden, DE Authors: Robert Khasanov and Jeronimo Castrillon, TU Dresden, DE Abstract Modern embedded computing platforms consist of a high amount of heterogeneous resources, which allows executing multiple applications on a single device. The number of running application on the system varies with time and so does the amount of available resources. This has considerably increased the complexity of analysis and optimization algorithms for runtime mapping of firm real-time applications. To reduce the runtime overhead, researchers have proposed to pre-compute partial mappings at compile time and have the runtime efficiently compute the final mapping. However, most existing solutions only compute a fixed mapping for a given set of running applications, and the mapping is defined for the entire duration of the workload execution. In this work we allow applications to adapt to the amount of available resources by using mapping segments. This way, applications may switch between different configurations with varied degree of parallelism. We present a runtime manager for firm real-time applications that generates such mapping segments based on partial solutions and aims at minimizing the overall energy consumption without deadline violations. The proposed algorithm outperforms the state-of-the-art approaches on the overall energy consumption by up to 13% while incurring an order of magnitude less scheduling overhead. Download Paper (PDF; Only available from the DATE venue WiFi)
16:00	IP3-11, 619	ON THE TASK MAPPING AND SCHEDULING FOR DAG-BASED EMBEDDED VISION APPLICATIONS ON HETEROGENEOUS MULTI/MANY-CORE ARCHITECTURES Speaker: Nicola Bombieri, Università di Verona, IT Authors: Stefano Aldegheri¹, Nicola Bombieri¹ and Hiren Patel² ¹Università di Verona, IT; ²University of Waterloo, CA Abstract In this work, we show that applying the heterogeneous earliest finish time (HEFT) heuristic for the task scheduling of embedded vision applications can improve the system performance up to 70% w.r.t. the scheduling solutions at the state of the art. We propose an algorithm called exclusive earliest finish time (XEFT) that introduces the notion of exclusive overlap between application primitives to improve the load balancing. We show that XEFT can improve the system performance up to 33% over HEFT, and 82% over the state of the art approaches. We present the results on different benchmarks, including a real-world localization and mapping application (ORB-SLAM) combined with the NVIDIA object detection application based on deep-learning. Download Paper (PDF; Only available from the DATE venue WiFi)
16:00		End of session

Time

Label

Presentation Title
Authors

14:30

7.5.1

RESOURCE-AWARE MAPREDUCE RUNTIME FOR MULTI/MANY-CORE ARCHITECTURES
Speaker:
Konstantinos Iliakis, MicroLab, ECE, NTUA, GR
Authors:
Konstantinos Iliakis¹, Sotirios Xydis¹ and Dimitrios Soudris²
¹National TU Athens, GR; ²National Technical University of Athens, GR
Abstract
Modern multi/many-core processors exhibit high integration densities, e.g. up to several dozens or hundreds of cores. To ease the application development burden for such systems, various programming frameworks have emerged. The MapReduce programming model, after having demonstrated its usability in the area of distributed systems, has been adapted to the needs of shared-memory many-core and multi-processor systems, showing promising results in comparison with conventional multi-threaded libraries, e.g. pthreads. In this paper, we propose a novel resource-aware MapReduce architecture. The proposed runtime decouples map and combine phases in order to enhance the parallelism degree, while it effectively overlaps the memory-intensive combine with the compute-intensive map operation resulting in superior resource utilization and performance improvements. A detailed sensitivity analysis to the framework's tuning knobs is provided. The decoupled MapReduce architecture is evaluated against the state-of-art library into two diverse systems, i.e. a Haswell server and a Xeon Phi co-processor, demonstrating speedups on average up-to 2.2x and 2.9x respectively.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:00

7.5.2

TOWARDS A QUALIFIABLE OPENMP FRAMEWORK FOR EMBEDDED SYSTEMS
Speaker:
Adrian Munera Sanchez, BSC, ES
Authors:
Adrián Munera Sánchez, Sara Royuela and Eduardo Quiñones, BSC, ES
Abstract
OpenMP is a very convenient parallel programming model to develop critical real-time applications by virtue of its powerful tasking model and its proven time predictable properties. However, current OpenMP implementations are not suitable due to the intensive use of dynamic memory to allocate data structures needed to efficiently manage the parallel execution. This jeopardizes the qualification processes of critical real-time systems, which are needed to ensure that the integrated system stack, including the OpenMP framework, is compliant with the system requirements. This paper proposes a novel OpenMP framework that statically allocates all the data structures needed to execute the OpenMP tasking model. Our framework is composed of a compiler phase that captures the data environment of all the OpenMP tasks instantiated along the parallel execution, and a run-time phase implementing a lazy task creation policy, that significantly reduces the memory requirements at run-time, whilst exploiting parallelism efficiently.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

7.5.3

ENERGY-EFFICIENT RUNTIME RESOURCE MANAGEMENT FOR ADAPTABLE MULTI-APPLICATION MAPPING
Speaker:
Robert Khasanov, TU Dresden, DE
Authors:
Robert Khasanov and Jeronimo Castrillon, TU Dresden, DE
Abstract
Modern embedded computing platforms consist of a high amount of heterogeneous resources, which allows executing multiple applications on a single device. The number of running application on the system varies with time and so does the amount of available resources. This has considerably increased the complexity of analysis and optimization algorithms for runtime mapping of firm real-time applications. To reduce the runtime overhead, researchers have proposed to pre-compute partial mappings at compile time and have the runtime efficiently compute the final mapping. However, most existing solutions only compute a fixed mapping for a given set of running applications, and the mapping is defined for the entire duration of the workload execution. In this work we allow applications to adapt to the amount of available resources by using mapping segments. This way, applications may switch between different configurations with varied degree of parallelism. We present a runtime manager for firm real-time applications that generates such mapping segments based on partial solutions and aims at minimizing the overall energy consumption without deadline violations. The proposed algorithm outperforms the state-of-the-art approaches on the overall energy consumption by up to 13% while incurring an order of magnitude less scheduling overhead.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:00

IP3-11, 619

ON THE TASK MAPPING AND SCHEDULING FOR DAG-BASED EMBEDDED VISION APPLICATIONS ON HETEROGENEOUS MULTI/MANY-CORE ARCHITECTURES
Speaker:
Nicola Bombieri, Università di Verona, IT
Authors:
Stefano Aldegheri¹, Nicola Bombieri¹ and Hiren Patel²
¹Università di Verona, IT; ²University of Waterloo, CA
Abstract
In this work, we show that applying the heterogeneous earliest finish time (HEFT) heuristic for the task scheduling of embedded vision applications can improve the system performance up to 70% w.r.t. the scheduling solutions at the state of the art. We propose an algorithm called exclusive earliest finish time (XEFT) that introduces the notion of exclusive overlap between application primitives to improve the load balancing. We show that XEFT can improve the system performance up to 33% over HEFT, and 82% over the state of the art approaches. We present the results on different benchmarks, including a real-world localization and mapping application (ORB-SLAM) combined with the NVIDIA object detection application based on deep-learning.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:00

End of session