9.2 Hot Topic - Transparent Use of Accelerators in Heterogeneous Computing Systems

Printer-friendly version PDF version

Date: Thursday 12 March 2015
Time: 08:30 - 10:00
Location / Room: Belle Etoile

Organisers:
Christian Plessl, University of Paderborn, DE
Heiner Giefers, IBM Research Zurich, CH

Chair:
Christian Plessl, University of Paderborn, DE

Co-Chair:
Heiner Giefers, IBM Research Zurich, CH

This hot topic session discusses recent research for transparent compilation and offloading of computational hotspots from CPUs to accelerators, in particular, many-core processors and FPGAs. The overarching objective of these approaches is to make the performance and energy-efficiency benefits of heterogeneous computing available to a broader spectrum of applications and users by reducing or even obviating the effort for porting applications.

TimeLabelPresentation Title
Authors
08:309.2.1TRANSPARENT ACCELERATION OF PROGRAM EXECUTION USING RECONFIGURABLE HARDWARE
Speakers:
Nuno Paulino1, João Canas Ferreira1, João Bispo2 and João M. P. Cardoso2
1INESC TEC and Faculty of Engineering, PT; 2University of Porto, PT
Abstract
The acceleration of applications, running on a general purpose processor (GPP), by mapping parts of their execution to reconfigurable hardware is an approach which does not involve program's source code and still ensures program portability over different target reconfigurable fabrics. However, the problem is very challenging, as suitable sequences of GPP instructions need to be translated/mapped to hardware, possibly at runtime. Thus, all mapping steps, from compiler analysis and optimizations to hardware generation, need to be both efficient and fast. This paper introduces some of the most representative approaches for binary acceleration using reconfigurable hardware, and presents our binary acceleration approach and the latest results. Our approach extends a GPP with a Reconfigurable Processing Unit (RPU), both sharing the data memory. Repeating sequences of GPP instructions are migrated to an RPU composed of functional units and interconnect resources, and able to exploit instruction-level parallelism, e.g., via loop pipelining. Although we envision a fully dynamic system, currently the RPU resources are selected and organized offline using execution trace information. We present implementation prototypes of the system on a Spartan-6 FPGA with a MicroBlaze as GPP and the very encouraging results achieved with a number of benchmarks.

Download Paper (PDF; Only available from the DATE venue WiFi)
08:529.2.2ACCELERATING ARITHMETIC KERNELS WITH COHERENT ATTACHED FPGA COPROCESSORS
Speakers:
Heiner Giefers, Raphael Polig and Christoph Hagleitner, IBM Research Zurich, CH
Abstract
Abstract—The energy efficiency of computer systems can be increased by migrating computational kernels that are known to under-utilize the CPU to an FPGA based coprocessor. In contrast to traditional I/O-based coprocessors that require explicit data movement, coherently attached accelerators can operate on the same virtual address space than the host CPU. A shared memory organization enables widely accepted programming models and helps to deploy energy efficient accelerators in general purpose computing systems. In this paper we study an FFT accelerator on FPGA attached via the Coherent Accelerator Processor Interface (CAPI) to a POWER8 processor. Our results show that the coherent attached accelerator outperforms device driver based approaches in terms of latency. Hardware acceleration delivers a 5x gain in energy efficiency compared to an optimized parallel software FFT running on a 12-core CPU and improves single thread performance by more than 2x. We conclude that the integration of CAPI into heterogeneous programming frameworks such as OpenCL will facilitate latency critical operations and will further enhance programmability of hybrid systems.

Download Paper (PDF; Only available from the DATE venue WiFi)
09:159.2.3TRANSPARENT OFFLOADING OF COMPUTATIONAL HOTSPOTS FROM BINARY CODE TO XEON PHI
Speakers:
Marvin Damschen1, Heinrich Riebler2, Gavin Vaz2 and Christian Plessl2
1Karlsruhe Institute of Technology (KIT), DE; 2University of Paderborn, DE
Abstract
In this paper, we study how binary applications can be transparently accelerated with novel heterogeneous computing resources without requiring any manual porting or developer-provided hints. Our work is based on Binary Acceleration At Runtime (BAAR), our previously introduced binary acceleration mechanism that uses the LLVM Compiler Infrastructure. BAAR is designed as a client-server architecture. The client runs the program to be accelerated in an environment, which allows program analysis and profiling and identifies and extracts suitable program parts to be offloaded. The server compiles and optimizes these offloaded program parts for the accelerator and offers access to these functions to the client with a remote procedure call (RPC) interface. Our previous work proved the feasibility of our approach, but also showed that communication time and overheads limit the granularity of functions that can be meaningfully offloaded. In this work, we motivate the importance of a lightweight, high-performance communication between server and client and present a communication mechanism based on the Message Passing Interface (MPI). We evaluate our approach by using an Intel Xeon Phi 5110P as the acceleration target and show that the communication overhead can be reduced from 40% to 10%, thus enabling even small hotspots to benefit from offloading to an accelerator.

Download Paper (PDF; Only available from the DATE venue WiFi)
09:379.2.4TRANSPARENT LINKING OF COMPILED SOFTWARE AND SYNTHESIZED HARDWARE
Speakers:
David Thomas1, Shane T. Fleming1, George A. Constantinides1 and Dan R. Ghica2
1Imperial College London, GB; 2University of Birmingham, GB
Abstract
Modern heterogeneous devices contain tightly coupled CPU and FPGA logic, allowing low latency access to accelerators. However, designers of the system need to treat accelerated functions specially, with device specific code for instantiating, configuring, and executing accelerators. We present a system level linker, which allows functions in hardware and software to be linked together to create heterogeneous systems. The linker works with post-compilation and post-synthesis components, allowing the designer to transparently move functions between devices simply by linking in either hardware or software object files. The linker places no special emphasis on the software, allowing computation to be initiated from within hardware, with function calls to software to provide services such as file access. A strong type-system ensures that individual code artifacts can be written using the conventions of that domain (C, HLS, VHDL), while allowing direct and transparent linking.

Download Paper (PDF; Only available from the DATE venue WiFi)
10:00End of session
Coffee Break in Exhibition Area

Coffee Break in Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Lunch Break

On Tuesday and Wednesday, lunch boxes will be served in front of the session room Salle Oisans and in the exhibition area for fully registered delegates (a voucher will be given upon registration on-site). On Thursday, lunch will be served in Room Les Ecrins (for fully registered conference delegates only).

Tuesday, March 10, 2015

Coffee Break 10:30 - 11:30

Lunch Break 13:00 - 14:30; Keynote session from 13:20 - 14:20 (Room Oisans) sponsored by Mentor Graphics

Coffee Break 16:00 - 17:00

Wednesday, March 11, 2015

Coffee Break 10:00 - 11:00

Lunch Break 12:30 - 14:30, Keynote lectures from 12:50 - 14:20 (Room Oisans)

Coffee Break 16:00 - 17:00

Thursday, March 12, 2015

Coffee Break 10:00 - 11:00

Lunch Break 12:30 - 14:00, Keynote lecture from 13:20 - 13:50

Coffee Break 15:30 - 16:00