5.7 Data-driven Acceleration

Printer-friendly version PDF version

Date: Wednesday 27 March 2019
Time: 08:30 - 10:00
Location / Room: Room 7

Chair:
Christian Fabre, CEA-Leti, FR

Co-Chair:
Borzoo Bonakdarpour, Iowa State University, US

This session presents accelerated computing paradigms guided by application-data criticality. The first paper presents a compiler for processing-in-memory (PIM) architectures. The second paper proposes a novel kernel tilling approach to reduce access to L2 cache. The third paper introduces data subsetting to reduce memory traffic for approximate computing platforms. The IPs deal with the RISC5 extensions for low-precision floating-point operations and GPU-based predictable execution.

TimeLabelPresentation Title
Authors
08:305.7.1A COMPILER FOR AUTOMATIC SELECTION OF SUITABLE PROCESSING-IN-MEMORY INSTRUCTIONS
Speaker:
Luigi Carro, UFRGS - Federal University of Rio Grande do Sul, BR
Authors:
hameeza ahmed1, Paulo Cesar Santos2, Joao Paulo Lima2, Rafael F. de Moura2, Marco Antonio Zanata Alves3, Antonio Carlos Schneider Beck2 and Luigi Carro2
1NED University of Engineering and Technology, PK; 2UFRGS - Universidade Federal do Rio Grande do Sul, BR; 3UFPR, BR
Abstract
Although not a new technique, due to the advent of 3D-stacked technologies, the integration of large memories and logic circuitry able to compute large amount of data has revived the Processing-in-Memory (PIM) techniques. PIM is a technique to increase performance while reducing energy consumption when dealing with large amounts of data. Despite several designs of PIM are available in the literature, their effective implementation still burdens the programmer. Also, various PIM instances are required to take advantage of the internal 3D-stacked memories, which further increases the challenges faced by the programmers. In this way, this work presents the Processing-In-Memory cOmpiler (PRIMO). Our compiler is able to efficiently exploit large vector units on a PIM architecture, directly from the original code. PRIMO is able to automatically select suitable PIM operations, allowing its automatic offloading. Moreover, PRIMO concerns about several PIM instances, selecting the most suitable instance while reduces internal communication between different PIM units. The compilation results of different benchmarks depict how PRIMO is able to exploit large vectors, while achieving a near-optimal performance when compared to the ideal execution for the case study PIM. PRIMO allows a speedup of 38x for specific kernels, while on average achieves 11.8x for a set of benchmarks from PolyBench Suite.

Download Paper (PDF; Only available from the DATE venue WiFi)
09:005.7.2CACHE-AWARE KERNEL TILING: AN APPROACH FOR SYSTEM-LEVEL PERFORMANCE OPTIMIZATION OF GPU-BASED APPLICATIONS
Speaker:
ARIAN MAGHAZEH, Linköping University, SE
Authors:
Arian Maghazeh1, Sudipta Chattopadhyay2, Petru Eles3 and Zebo Peng1
1Linköping University, SE; 2Singapore University of Technology and Design (SUTD), SG; 3Linkoping University, SE
Abstract
We present a software approach to address the data latency issue for certain GPU applications. Each application is modeled as a kernel graph, where the nodes represent individual GPU kernels and the edges capture data dependencies. Our technique exploits the GPU L2 cache to accelerate parameter passing between the kernels. The key idea is that, instead of having each kernel process the entire input in one invocation, we subdivide the input into fragments (which fit in the cache) and, ideally, process each fragment in one continuous sequence of kernel invocations. Our proposed technique is oblivious to kernel functionalities and requires minimal source code modification. We demonstrate our technique on a full-fledged image processing application and improve the performance on average by 30% over various settings.

Download Paper (PDF; Only available from the DATE venue WiFi)
09:305.7.3(Best Paper Award Candidate)
DATA SUBSETTING: A DATA-CENTRIC APPROACH TO APPROXIMATE COMPUTING
Speaker:
Younghoon Kim, Purdue University, KR
Authors:
Younghoon Kim1, Swagath Venkataramani2, Nitin Chandrachoodan3 and Anand Raghunathan1
1Purdue University, US; 2IBM T. J. Watson Research Center, US; 3Indian Institute of Technology Madras, IN
Abstract
Approximate Computing (AxC), which leverages the intrinsic resilience of applications to approximations in their underlying computations, has emerged as a promising approach to improving computing system efficiency. Most prior efforts in AxC take a compute-centric approach and approximate arithmetic or other compute operations through design techniques at different levels of abstraction. However, emerging workloads such as machine learning, search and data analytics process large amounts of data and are significantly limited by the memory sub-systems of modern computing platforms. In this work, we shift the focus of approximations from computations to data, and propose a data-centric approach to AxC, which can boost the performance of memory-subsystem-limited applications. The key idea is to modulate the application's data-accesses in a manner that reduces off-chip memory traffic. Specifically, we propose a data-access approximation technique called data subsetting, in which all accesses to a data structure are redirected to a subset of its elements so that the overall footprint of memory accesses is decreased. We realize data subsetting in a manner that is transparent to hardware and requires only minimal changes to application software. Recognizing that most applications of interest represent and process data as multi-dimensional arrays or tensors, we develop a templated data structure called SubsettableTensor that embodies mechanisms to define the accessible subset and to suitably redirect accesses to elements outside the subset. As a further optimization, we observe that data subsetting may cause some computations to become redundant and propose a mechanism for application software to identify and eliminate such computations. We implement SubsettableTensor as a C++ class and evaluate it using parallel software implementations of 7 machine learning applications on a 48-core AMD Opteron server. Our experiments indicate that data subsetting enables 1.33x to 4.44x performance improvement with <0.5% loss in application-level quality, underscoring its promise as a new approach to approximate computing.

Download Paper (PDF; Only available from the DATE venue WiFi)
10:00IP2-18, 673TAMING DATA CACHES FOR PREDICTABLE EXECUTION ON GPU-BASED SOCS
Speaker:
Björn Forsberg, ETH Zürich, CH
Authors:
Björn Forsberg1, Luca Benini2 and Andrea Marongiu3
1ETH Zürich, CH; 2Università di Bologna, IT; 3University of Bologna, IT
Abstract
Heterogeneous SoCs (HeSoCs) typically share a single DRAM between the CPU and GPU, making workloads susceptible to memory interference, and predictable execution troublesome. State-of-the art predictable execution models (PREM) for HeSoCs prefetch data to the GPU scratchpad memory (SPM), for computations to be insensitive to CPU-generated DRAM traffic. However, the amount of work that the small SPM sizes allow is typically insufficient to absorb CPU/GPU synchronization costs. On-chip caches are larger, and would solve this issue, but have been argued too unpredictable due to self-evictions. We show how self-eviction can be minimized in GPU caches via clever managing of prefetches, thus lowering the performance cost, while retaining timing predictability.

Download Paper (PDF; Only available from the DATE venue WiFi)
10:01IP2-19, 739DESIGN AND EVALUATION OF SMALLFLOAT SIMD EXTENSIONS TO THE RISC-V ISA
Speaker:
Giuseppe Tagliavini, University of Bologna, IT
Authors:
Giuseppe Tagliavini1, Stefan Mach2, Davide Rossi3, Andrea Marongiu1 and Luca Benini4
1University of Bologna, IT; 2ETH Zurich, CH; 3University Of Bologna, IT; 4Università di Bologna, IT
Abstract
RISC-V is an open-source instruction set architecture (ISA) with a modular design consisting of a mandatory base part plus optional extensions. The RISC-V 32IMFC ISA configuration has been widely adopted for the design of new-generation, low-power processors. Motivated by the important energy savings that smaller-than-32-bit FP types have enabled in several application domains and related compute platforms, some recent studies have published encouraging early results for their adoption in RISC-V processors. In this paper we introduce a set of ISA extensions for RISC-V 32IMFC, supporting scalar and SIMD operations (fitting the 32-bit register size) for 8-bit and two 16-bit FP types. The proposed extensions are enabled by exposing the new FP types to the standard C/C++ type system and an implementation for the RISC-V GCC compiler is presented. As a further, novel contribution, we extensively characterize the performance and energy savings achievable with the proposed extensions. On average, experimental results show that their adoption provide benefits in terms of performance (1.64x speedup for 16-bit and 2.18x for 8-bit types) and energy consumption (30% saving for 16-bit and 50% for 8-bit types). We also illustrate an approach based on automatic precision tuning to make effective use of the new FP types.

Download Paper (PDF; Only available from the DATE venue WiFi)
10:02IP2-20, 24VDARM: DYNAMIC ADAPTIVE RESOURCE MANAGEMENT FOR VIRTUALIZED MULTIPROCESSOR SYSTEMS
Speaker:
Jianmin Qian, Shanghai Jiao Tong University, CN
Authors:
Jianmin Qian, Jian Li, Ruhui Ma and Haibing Guan, Shanghai Jiao Tong University, CN
Abstract
Modern data center servers have been enhancing their computing capacity by increasing processor counts. Meanwhile, these servers are highly virtualized to achieve efficient resource utilization and energy savings. However, due to the shifting of server architecture to non-uniform memory access (NUMA), current hypervisor-level or OS-level resource management methods continue to be challenged in their ability to meet the performance requirement of various user applications. In this work, we first build a performance slowdown model to accurate identify the current system overheads. Based on the model, we finally design a dynamic adaptive virtual resource management method (vDARM) to eliminate the runtime NUMA overheads by re-configuring virtual-to-physical resource mappings. Experiment results show that, compared with state-of-art approaches, vDARM can bring up an average performance improvement of 36.2% on a 8-node NUMA machines. Meanwhile, vDARM only incurs extra CPU utilization no more than 4%.

Download Paper (PDF; Only available from the DATE venue WiFi)
10:00End of session
Coffee Break in Exhibition Area



Coffee Breaks in the Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Lunch Breaks (Lunch Area)

On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the Lunch Area to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area.

Tuesday, March 26, 2019

Wednesday, March 27, 2019

Thursday, March 28, 2019