6.1 Special Day on "Embedded Meets Hyperscale and HPC" Session: Near-memory computing

Printer-friendly version PDF version

Date: Wednesday 27 March 2019
Time: 11:00 - 12:30
Location / Room: Room 1

Chair:
Christoph Hagleitner, IBM Research, CH

Co-Chair:
Christian Plessl, Paderborn University, DE

While it used to be easy to increase the peak computational capabilities of processors by exploiting the growth in available transistors delivered by Moore's law, the latency and bandwidth of the memory system did not improve at the same pace. Today's microprocessors hide this fact behind a complex memory hierarchy, but often fail to optimally utilize the available memory bandwidth across a broad range of applications. Near-memory computing takes a fresh look at the memory system and proposes innovations ranging from micro-architecture to the runtime system to address these bottlenecks and build more balanced computing systems

TimeLabelPresentation Title
Authors
11:006.1.1NTX: AN ENERGY-EFFICIENT STREAMING ACCELERATOR FOR FLOATING-POINT GENERALIZED REDUCTION WORKLOADS IN 22NM FD-SOI
Speaker:
Luca Benini, IIS, ETH Zürich, CH
Authors:
Fabian Schuiki, Michael Schaffner and Luca Benini, IIS, ETH Zürich, CH
Abstract
Specialized coprocessors for Multiply-Accumulate (MAC) intensive workloads such as Deep Learning are becoming widespread in SoC platforms, from GPUs to mobile SoCs. In this paper we revisit NTX (an efficient accelerator developed for training Deep Neural Networks at scale) as a generalized MAC and reduction streaming engine. The architecture consists of a set of 32 bit floating-point streaming co-processors that are loosely coupled to a RISC-V core in charge of orchestrating data movement and computation. Post-layout results of a recent silicon implementation in 22nm FD-SOI technology show the accelerator's capability to deliver up to 20Gflop/s at 1.25GHz and 168mW. Based on these results we show that a version of NTX scaled down to 14nm can achieve a 3× energy efficiency improvement over contemporary GPUs at 10.4× less silicon area, and a compute performance of 1.4Tflop/s for training large state-of-the-art networks with full floating-point precision. An extended evaluation of MAC-intensive kernels shows that NTX can consistently achieve up to 87% of its peak performance across general reduction workloads beyond machine learning. Its modular architecture enables deployment at different scales ranging from high-performance GPU-class to low-power embedded scenarios.

Download Paper (PDF; Only available from the DATE venue WiFi)
11:306.1.2NEAR-MEMORY PROCESSING: IT'S THE HARDWARE AND SOFTWARE, SILLY!
Speaker and Author:
Boris Grot, University of Edinburgh, GB
Abstract
Conventional computing systems are increasingly challenged by the need to process rapidly growing volumes of data, often at online speeds. One promising way to boost compute efficiency is through Near-Memory Processing (NMP), which integrates light-weight compute logic close to the memory arrays. NMP affords massive bandwidth to the memory-resident data and dramatically reduces energy-hungry data movement.  A key challenge for effectively leveraging NMP is that today's high-performance data processing algorithms have been designed for CPUs with powerful cores, large caches, and bandwidth-constrained memory interfaces. Meanwhile, NMP architectures are limited to simple logic and small caches while offering abundant memory bandwidth. Hence, achieving high efficiency with NMP requires a careful algorithm-hardware co-design to maximize bandwidth utilization given a highly constrained area and power budget. I will describe one instance of such a co-designed NMP architecture for data analytics, and show that it reaps significant performance and energy-efficiency advantages over both CPU-based and baseline NMP systems.  
12:006.1.3COHERENTLY ATTACHED PROGRAMMABLE NEAR-MEMORY ACCELERATION PLATFORM AND ITS APPLICATION TO STENCIL PROCESSING
Speaker:
Jan van Lunteren, IBM Research Zurich, CH
Authors:
Jan van Lunteren, Ronald Luijten, Dionysios Diamantopoulos, Florian Auernhammer, Christoph Hagleitner, Lorenzo Chelini, Stefano Corda and Gagandeep Singh, IBM Research Zurich, CH
Abstract
Application and technology trends are increasingly forcing computer systems to be designed for specific workloads and application domains. Although memory is one of the key components impacting the performance and power consumption of state-of-art computer systems, its operation typically cannot be adapted to workload characteristics beyond some limited controller configuration options. In this paper, we present a novel near-memory acceleration platform based on an Access Processor that enables the main memory system operation to be programmed and adapted dynamically to the accelerated workload. The platform targets both ASIC and FPGA implemen- tations integrated within IBM POWER systems. We show how this platform can be applied to accelerate stencil processing.
12:30End of session
Lunch Break in Lunch Area



Coffee Breaks in the Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Lunch Breaks (Lunch Area)

On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the Lunch Area to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area.

Tuesday, March 26, 2019

Wednesday, March 27, 2019

Thursday, March 28, 2019