Adaptive Resource Control in Multi-core Systems Alexei Iliasov, Ashur Rafiev, Alexander Romanovsky, Andrey Mokhov, Alex Yakovlev, and Fei Xia, Newcastle University, UK
Multi-core systems present a set of unique challenges and opportunities. In this paper we discuss the issues of power-proportional computing in a multi-core environment and argue that a cross-layer approach spanning from hardware to user-facing software is necessary to successfully address this problem.
Criticality-Aware Functionality Allocation for Distributed Multicore Real-Time Systems Junhe Gan, Paul Pop, and Jan Madsen, Technical University of Denmark, DK
We are interested in the implementation of mixed-criticality hard real-time applications on distributed architectures, composed of interconnected multicore processors, where each processing core is called a processing element (PE). The functionality of the mixed-criticality hard real-time applications is captured in the early design stages using functional blocks of different Safety-Integrity Levels (SILs). Before the applications are implemented, the functional blocks have to be decomposed into software tasks with SILs. Then, the software tasks have to be mapped and scheduled on the PEs of the architecture. We consider fixed-priority preemptive scheduling for tasks and non-preemptive scheduling for messages. We would like to determine the function-to-task decomposition, the type of PEs in the architecture and the mapping of tasks to the PEs, such that the total cost is minimized, the application is schedulable and the safety and security constraints are satisfied. The total costs capture the development and certification costs and the unit cost of the architecture. We propose a Genetic Algorithm-based approach to solve this two-objective optimization problem, and evaluate it using a real-life case-study from the automotive industry.
Estimating Video Decoding Energies And Processing Times Utilizing Virtual Hardware Sebastian Berschneider, Christian Herglotz, Marc Reichenbach, Dietmar Fey, and André Kaup, Friedrich-Alexander-University Erlangen-Nuremberg, DE
The market for embedded devices increases permanently. Especially cell- and smartphones, which are substantial tools for many people, become more and more complex and serve nowadays as portable computers. An important problem to these devices is the energy efficiency. The accumulator battery can be discharged within a few hours, especially when a smartphone processes computationally intensive tasks like video decoding. Therefore, modern devices tend to include power efficient processors. But not only power efficient hardware effects the overall power consumption, also the design of algorithms regarding energy efficient programming is an important task. Usually, energy efficient development is done using real hardware, where programs are executed and power consumption is measured. This process is highly costly and error prone. Moreover, expensive hardware equipment is necessary. Therefore, in this work we present a design methodology that enables to run the application software on virtual hardware (CPU) that counts the instructions and memory accesses. By multiplying a priorly measured energy and time per instruction to these counts, energy and time estimations are possible, without having to run the target application on real hardware. As a result, we present a methodology for writing embedded applications with immediate feedback about these non-functional properties.
Increased Reliability of Many-Core Platforms through Thermal Feedback Control Matthias Becker, Kristian Sandström, Moris Behnam, and Thomas Nolte, MRTC / Mälardalen University, SE
In this paper we present a low overhead thermal management approach to increase reliability of many-core embedded real-time systems. Each core is controlled by a feedback controller. We adapt the utilization of the core in order to decrease the dynamic power consumption and thus the corresponding heat development. Sophisticated control mechanisms allow us to migrate the load in advance, before reaching critical temperature values and thus we can migrate in a safe way with a guarantee to meet all deadlines.
Performance Analysis of a Computer Vision Application with the STHORM OpenCL SDK Vítor Schwambach, Sébastien Cleyet-Merle, Alain Issard, STMicroelectronics, FR and Stéphane Mancini, TIMA lab, FR
Computer vision applications constitute one of the key drivers for embedded many-core architectures. To enable parallel application performance estimation and optimization early in the development flow, the development environment must provide the developer with simulation tools for fast and precise application-level performance analysis. In this work, we port a face detection application onto the STHORM many-core accelerator using the STHORM OpenCL SDK. We compare performance results obtained with the STHORM cycle-approximate simulator and a prototype implementation, and show that a high mismatch is present. We identify the key contributors to this mismatch, and propose that these be addressed in the upcoming versions of the SDK to allow more precise simulation results for early design space exploration.
PSE - Performance Simulation Environment Jussi Hanhirova and Vesa Hirvisalo, Aalto University, FI
We use a resource reservation based simulation environment (PSE) as a research tool to experiment on how to co-model HW/SW schedulers. Our focus is on heterogenous systems with manycores. Task processing based systems use different load balancing schemes to make efficient use of resources and to schedule work within real-time constraints. As parallel MPSoCs are constantly evolving, simulation is a viable tool to explore different configurations.
Scaling Performance of FFT Computation on an Industrial Integrated GPU Co-processor: Experiments with Algorithm Adaptation. Mohamed Amine Bergach and Serge Tissot, Kontron, FR, Michel Syska and Robert De Simone, Inria, FR
Recent Intel processors (IvyBridge, Haswell) contain an embedded on-chip GPU unit, in addition to the main CPU processor. In this work we consider the issue of efficiently mapping Fast Fourier Transform computation onto such coprocessor units. To achieve this we pursue three goals:
First, we want to study half-systematic ways to adjust the actual variant of the FFT algorithm, for a given size, to best fit the local memory capacity (the registers of a given GPU block) and perform computations without intermediate calls to distant memory;
Second, we want to study, by extensive experimentation, whether the remaining data transfers between memories (initial loads and final stores after each FFT computation) can be sustained by local interconnects at a speed matching the integrated GPU computations, or conversely if they have a negative impact on performance when computing FFTs on GPUs ”at full blast”;
Third, we want to record the energy consumption as observed in the previous experiments, and compare it to similar FFT implementations on the CPU side of the chip.
We report our work in this short paper and its companion poster, showing graphical results on a range of experiments. In broad terms, our findings are that GPUs can compute FFTs of a typical size faster than internal on-chip interconnects can provide them with data (by a factor of roughly 2), and that energy consumption is far smaller than on the CPU side.
Smart Scheduling of Streaming Applications via Timed Automata Waheed Ahmad, Robert de Groote, Philip K.F. Hölzenspies, Mariëlle Stoelinga, and Jaco van de Pol, University of Twente, NE
Streaming applications such as video-in-video and multi-video conferencing impose high demands on system performance. On one hand, they require high system throughput. On the other hand, usage of the available resources must be kept to minimum in order to save energy. Synchronous dataflow (SDF) graphs are very popular computational models for analysing streaming applications. Recently, they are widely used for analysis of the streaming applications on a single processor as well as in a multiprocessing context. Smart scheduling techniques are critical for system lifetime so that the maximum throughput is obtained by running as few resources as possible.
Current maximum throughput calculation methods of the SDF graphs requires an unbounded number of processors or static order scheduling of tasks. Other novel methods involves the conversion of an SDF graph to an equivalent Homogeneous SDF graph (HSDF). This approach results in a bigger graph; in the worst case, the size of converted HSDF graph could be exponentially bigger.
This poster presents an alternative, novel approach to analyse SDF graphs on a given number of processors using a proved formalism for timed systems termed Timed Automata (TA). By definition, TA are automata in which the elapse of time is measured by clock variables. The conditions under which a transition can be taken are indicated by clock guards. Furthermore, invariants shows the conditions for a system to stay in a certain state. Synchronous communication between the timed automata is carried out by hand-shake synchronisation using input and output actions. Output and input actions are denoted with an exclamation mark and a question mark respectively, e.g. fire! and fire?. TA hold a good balance between expressiveness and tractability and are supported by various verification tools e.g. UPPAAL.
We translate the SDF graph of an application, and a given architecture of computer processors into separate timed automaton. Both automata synchronise using the actions "req" and "fire". In this way, timed automaton of the application SDF graph is mapped on the timed automata of the architecture model. After that, we can analyse the performance using different measures of interest.
In particular, the main contributions of this poster are: (1) Compositional translation of the SDF graphs into timed automata; (2) Exploiting the capabilities of UPPAAL to search the whole state-space and to find the schedule that fits on the available processors and maximises the throughput; (3) Finding the maximum throughput on homogeneous and heterogeneous platforms; (4) Quantitative model-checking. We also demonstrate that the deadlock freedom is preserved even if the number of processors varies.
Results show that in some cases, the maximum throughput of an SDF graph remains same even if the number of processors is reduced. Similarly, a trade-off between the given number of processors and the maximum throughput can be obtained efficiently. Moreover, the benefits of quantitative model-checking and verification of the user-defined properties can also be enjoyed using different contemporary model-checkers.
Future work includes energy optimal synthesis and scheduling, translation of the SDF graphs to Energy Aware Automata, extension of SDF graphs with energy costs and stochastics, dynamic power management (DPM) and reduction techniques of energy models. In order to tackle state-space explosion, we also plan to apply multi-core LTL model checking.
System Level Design Framework for Many-core Architectures Pablo Peñil, Luis Diaz, and Pablo Sanchez, University of Cantabria, ES
The complexity of the embedded, many-core architectures has been constantly increasing their shipment volume in recent years, providing a solution for creating highly optimized complex systems. In order to deal with the complexity of these many-core architectures, users are requiring new design methodologies that encompass system specification and performance analysis from the initial stages of the design process. The performance analysis frameworks should include software application and many-core hardware platform co-simulation in order to obtain estimations of the software execution time and performance of platform HW resources. This paper presents a fully-integrated host-compiled simulation framework which enables obtaining fast performance estimations for high-level system models. This framework could be integrated in a design exploration methodology that enables to choose the optimal specification and software parallelization, facilitating system implementation and minimizing designer effort.
|