5.7 Data-driven Acceleration

Time	Label	Presentation Title Authors
08:30	5.7.1	A COMPILER FOR AUTOMATIC SELECTION OF SUITABLE PROCESSING-IN-MEMORY INSTRUCTIONS Speaker: Luigi Carro, UFRGS - Federal University of Rio Grande do Sul, BR Authors: hameeza ahmed¹, Paulo Cesar Santos², Joao Paulo Lima², Rafael F. de Moura², Marco Antonio Zanata Alves³, Antonio Carlos Schneider Beck² and Luigi Carro² ¹NED University of Engineering and Technology, PK; ²UFRGS - Universidade Federal do Rio Grande do Sul, BR; ³UFPR, BR Abstract Although not a new technique, due to the advent of 3D-stacked technologies, the integration of large memories and logic circuitry able to compute large amount of data has revived the Processing-in-Memory (PIM) techniques. PIM is a technique to increase performance while reducing energy consumption when dealing with large amounts of data. Despite several designs of PIM are available in the literature, their effective implementation still burdens the programmer. Also, various PIM instances are required to take advantage of the internal 3D-stacked memories, which further increases the challenges faced by the programmers. In this way, this work presents the Processing-In-Memory cOmpiler (PRIMO). Our compiler is able to efficiently exploit large vector units on a PIM architecture, directly from the original code. PRIMO is able to automatically select suitable PIM operations, allowing its automatic offloading. Moreover, PRIMO concerns about several PIM instances, selecting the most suitable instance while reduces internal communication between different PIM units. The compilation results of different benchmarks depict how PRIMO is able to exploit large vectors, while achieving a near-optimal performance when compared to the ideal execution for the case study PIM. PRIMO allows a speedup of 38x for specific kernels, while on average achieves 11.8x for a set of benchmarks from PolyBench Suite. Download Paper (PDF; Only available from the DATE venue WiFi)
09:00	5.7.2	CACHE-AWARE KERNEL TILING: AN APPROACH FOR SYSTEM-LEVEL PERFORMANCE OPTIMIZATION OF GPU-BASED APPLICATIONS Speaker: ARIAN MAGHAZEH, Linköping University, SE Authors: Arian Maghazeh¹, Sudipta Chattopadhyay², Petru Eles³ and Zebo Peng¹ ¹Linköping University, SE; ²Singapore University of Technology and Design (SUTD), SG; ³Linkoping University, SE Abstract We present a software approach to address the data latency issue for certain GPU applications. Each application is modeled as a kernel graph, where the nodes represent individual GPU kernels and the edges capture data dependencies. Our technique exploits the GPU L2 cache to accelerate parameter passing between the kernels. The key idea is that, instead of having each kernel process the entire input in one invocation, we subdivide the input into fragments (which fit in the cache) and, ideally, process each fragment in one continuous sequence of kernel invocations. Our proposed technique is oblivious to kernel functionalities and requires minimal source code modification. We demonstrate our technique on a full-fledged image processing application and improve the performance on average by 30% over various settings. Download Paper (PDF; Only available from the DATE venue WiFi)
09:30	5.7.3	(Best Paper Award Candidate) DATA SUBSETTING: A DATA-CENTRIC APPROACH TO APPROXIMATE COMPUTING Speaker: Younghoon Kim, Purdue University, KR Authors: Younghoon Kim¹, Swagath Venkataramani², Nitin Chandrachoodan³ and Anand Raghunathan¹ ¹Purdue University, US; ²IBM T. J. Watson Research Center, US; ³Indian Institute of Technology Madras, IN Abstract Approximate Computing (AxC), which leverages the intrinsic resilience of applications to approximations in their underlying computations, has emerged as a promising approach to improving computing system efficiency. Most prior efforts in AxC take a compute-centric approach and approximate arithmetic or other compute operations through design techniques at different levels of abstraction. However, emerging workloads such as machine learning, search and data analytics process large amounts of data and are significantly limited by the memory sub-systems of modern computing platforms. In this work, we shift the focus of approximations from computations to data, and propose a data-centric approach to AxC, which can boost the performance of memory-subsystem-limited applications. The key idea is to modulate the application's data-accesses in a manner that reduces off-chip memory traffic. Specifically, we propose a data-access approximation technique called data subsetting, in which all accesses to a data structure are redirected to a subset of its elements so that the overall footprint of memory accesses is decreased. We realize data subsetting in a manner that is transparent to hardware and requires only minimal changes to application software. Recognizing that most applications of interest represent and process data as multi-dimensional arrays or tensors, we develop a templated data structure called SubsettableTensor that embodies mechanisms to define the accessible subset and to suitably redirect accesses to elements outside the subset. As a further optimization, we observe that data subsetting may cause some computations to become redundant and propose a mechanism for application software to identify and eliminate such computations. We implement SubsettableTensor as a C++ class and evaluate it using parallel software implementations of 7 machine learning applications on a 48-core AMD Opteron server. Our experiments indicate that data subsetting enables 1.33x to 4.44x performance improvement with <0.5% loss in application-level quality, underscoring its promise as a new approach to approximate computing. Download Paper (PDF; Only available from the DATE venue WiFi)
10:00	IP2-18, 673	TAMING DATA CACHES FOR PREDICTABLE EXECUTION ON GPU-BASED SOCS Speaker: Björn Forsberg, ETH Zürich, CH Authors: Björn Forsberg¹, Luca Benini² and Andrea Marongiu³ ¹ETH Zürich, CH; ²Università di Bologna, IT; ³University of Bologna, IT Abstract Heterogeneous SoCs (HeSoCs) typically share a single DRAM between the CPU and GPU, making workloads susceptible to memory interference, and predictable execution troublesome. State-of-the art predictable execution models (PREM) for HeSoCs prefetch data to the GPU scratchpad memory (SPM), for computations to be insensitive to CPU-generated DRAM traffic. However, the amount of work that the small SPM sizes allow is typically insufficient to absorb CPU/GPU synchronization costs. On-chip caches are larger, and would solve this issue, but have been argued too unpredictable due to self-evictions. We show how self-eviction can be minimized in GPU caches via clever managing of prefetches, thus lowering the performance cost, while retaining timing predictability. Download Paper (PDF; Only available from the DATE venue WiFi)
10:01	IP2-19, 739	DESIGN AND EVALUATION OF SMALLFLOAT SIMD EXTENSIONS TO THE RISC-V ISA Speaker: Giuseppe Tagliavini, University of Bologna, IT Authors: Giuseppe Tagliavini¹, Stefan Mach², Davide Rossi³, Andrea Marongiu¹ and Luca Benini⁴ ¹University of Bologna, IT; ²ETH Zurich, CH; ³University Of Bologna, IT; ⁴Università di Bologna, IT Abstract RISC-V is an open-source instruction set architecture (ISA) with a modular design consisting of a mandatory base part plus optional extensions. The RISC-V 32IMFC ISA configuration has been widely adopted for the design of new-generation, low-power processors. Motivated by the important energy savings that smaller-than-32-bit FP types have enabled in several application domains and related compute platforms, some recent studies have published encouraging early results for their adoption in RISC-V processors. In this paper we introduce a set of ISA extensions for RISC-V 32IMFC, supporting scalar and SIMD operations (fitting the 32-bit register size) for 8-bit and two 16-bit FP types. The proposed extensions are enabled by exposing the new FP types to the standard C/C++ type system and an implementation for the RISC-V GCC compiler is presented. As a further, novel contribution, we extensively characterize the performance and energy savings achievable with the proposed extensions. On average, experimental results show that their adoption provide benefits in terms of performance (1.64x speedup for 16-bit and 2.18x for 8-bit types) and energy consumption (30% saving for 16-bit and 50% for 8-bit types). We also illustrate an approach based on automatic precision tuning to make effective use of the new FP types. Download Paper (PDF; Only available from the DATE venue WiFi)
10:02	IP2-20, 24	VDARM: DYNAMIC ADAPTIVE RESOURCE MANAGEMENT FOR VIRTUALIZED MULTIPROCESSOR SYSTEMS Speaker: Jianmin Qian, Shanghai Jiao Tong University, CN Authors: Jianmin Qian, Jian Li, Ruhui Ma and Haibing Guan, Shanghai Jiao Tong University, CN Abstract Modern data center servers have been enhancing their computing capacity by increasing processor counts. Meanwhile, these servers are highly virtualized to achieve efficient resource utilization and energy savings. However, due to the shifting of server architecture to non-uniform memory access (NUMA), current hypervisor-level or OS-level resource management methods continue to be challenged in their ability to meet the performance requirement of various user applications. In this work, we first build a performance slowdown model to accurate identify the current system overheads. Based on the model, we finally design a dynamic adaptive virtual resource management method (vDARM) to eliminate the runtime NUMA overheads by re-configuring virtual-to-physical resource mappings. Experiment results show that, compared with state-of-art approaches, vDARM can bring up an average performance improvement of 36.2% on a 8-node NUMA machines. Meanwhile, vDARM only incurs extra CPU utilization no more than 4%. Download Paper (PDF; Only available from the DATE venue WiFi)
10:00		End of session Coffee Break in Exhibition Area Coffee Breaks in the Exhibition Area On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area. Lunch Breaks (Lunch Area) On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the Lunch Area to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area. Tuesday, March 26, 2019 Coffee Break 10:30 - 11:30 Lunch Break 13:00 - 14:30 Keynote Lecture "Leonardo da Vinci, Humanism and Engineering between Florence and Milan" by Claudio Giorgione in room 1 13:50 - 14:20 Coffee Break 16:00 - 17:00 Wednesday, March 27, 2019 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:30 Keynote Lecture "Heterogeneous, High Scale Computing in the Era of Intelligent, Cloud-Connected" by David Pellerin, Amazon, US in room 1 13:50 - 14:20 Coffee Break 16:00 - 17:00 Thursday, March 28, 2019 Coffee Break 10:00 - 11:00 University Booth Best Demo Award Presentation at the University Booth 10:30 Lunch Break 12:30 - 14:00 Keynote Lecture "A Fundamental Look at Models and Intelligence" by Edward A. Lee, University of California, Berkeley, US in room 1 13:20 - 13:50 Coffee Break 15:30 - 16:00

Time

Label

Presentation Title
Authors

08:30

5.7.1

A COMPILER FOR AUTOMATIC SELECTION OF SUITABLE PROCESSING-IN-MEMORY INSTRUCTIONS
Speaker:
Luigi Carro, UFRGS - Federal University of Rio Grande do Sul, BR
Authors:
hameeza ahmed¹, Paulo Cesar Santos², Joao Paulo Lima², Rafael F. de Moura², Marco Antonio Zanata Alves³, Antonio Carlos Schneider Beck² and Luigi Carro²
¹NED University of Engineering and Technology, PK; ²UFRGS - Universidade Federal do Rio Grande do Sul, BR; ³UFPR, BR
Abstract
Although not a new technique, due to the advent of 3D-stacked technologies, the integration of large memories and logic circuitry able to compute large amount of data has revived the Processing-in-Memory (PIM) techniques. PIM is a technique to increase performance while reducing energy consumption when dealing with large amounts of data. Despite several designs of PIM are available in the literature, their effective implementation still burdens the programmer. Also, various PIM instances are required to take advantage of the internal 3D-stacked memories, which further increases the challenges faced by the programmers. In this way, this work presents the Processing-In-Memory cOmpiler (PRIMO). Our compiler is able to efficiently exploit large vector units on a PIM architecture, directly from the original code. PRIMO is able to automatically select suitable PIM operations, allowing its automatic offloading. Moreover, PRIMO concerns about several PIM instances, selecting the most suitable instance while reduces internal communication between different PIM units. The compilation results of different benchmarks depict how PRIMO is able to exploit large vectors, while achieving a near-optimal performance when compared to the ideal execution for the case study PIM. PRIMO allows a speedup of 38x for specific kernels, while on average achieves 11.8x for a set of benchmarks from PolyBench Suite.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:00

5.7.2

CACHE-AWARE KERNEL TILING: AN APPROACH FOR SYSTEM-LEVEL PERFORMANCE OPTIMIZATION OF GPU-BASED APPLICATIONS
Speaker:
ARIAN MAGHAZEH, Linköping University, SE
Authors:
Arian Maghazeh¹, Sudipta Chattopadhyay², Petru Eles³ and Zebo Peng¹
¹Linköping University, SE; ²Singapore University of Technology and Design (SUTD), SG; ³Linkoping University, SE
Abstract
We present a software approach to address the data latency issue for certain GPU applications. Each application is modeled as a kernel graph, where the nodes represent individual GPU kernels and the edges capture data dependencies. Our technique exploits the GPU L2 cache to accelerate parameter passing between the kernels. The key idea is that, instead of having each kernel process the entire input in one invocation, we subdivide the input into fragments (which fit in the cache) and, ideally, process each fragment in one continuous sequence of kernel invocations. Our proposed technique is oblivious to kernel functionalities and requires minimal source code modification. We demonstrate our technique on a full-fledged image processing application and improve the performance on average by 30% over various settings.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:30

5.7.3

(Best Paper Award Candidate)
DATA SUBSETTING: A DATA-CENTRIC APPROACH TO APPROXIMATE COMPUTING
Speaker:
Younghoon Kim, Purdue University, KR
Authors:
Younghoon Kim¹, Swagath Venkataramani², Nitin Chandrachoodan³ and Anand Raghunathan¹
¹Purdue University, US; ²IBM T. J. Watson Research Center, US; ³Indian Institute of Technology Madras, IN
Abstract
Approximate Computing (AxC), which leverages the intrinsic resilience of applications to approximations in their underlying computations, has emerged as a promising approach to improving computing system efficiency. Most prior efforts in AxC take a compute-centric approach and approximate arithmetic or other compute operations through design techniques at different levels of abstraction. However, emerging workloads such as machine learning, search and data analytics process large amounts of data and are significantly limited by the memory sub-systems of modern computing platforms. In this work, we shift the focus of approximations from computations to data, and propose a data-centric approach to AxC, which can boost the performance of memory-subsystem-limited applications. The key idea is to modulate the application's data-accesses in a manner that reduces off-chip memory traffic. Specifically, we propose a data-access approximation technique called data subsetting, in which all accesses to a data structure are redirected to a subset of its elements so that the overall footprint of memory accesses is decreased. We realize data subsetting in a manner that is transparent to hardware and requires only minimal changes to application software. Recognizing that most applications of interest represent and process data as multi-dimensional arrays or tensors, we develop a templated data structure called SubsettableTensor that embodies mechanisms to define the accessible subset and to suitably redirect accesses to elements outside the subset. As a further optimization, we observe that data subsetting may cause some computations to become redundant and propose a mechanism for application software to identify and eliminate such computations. We implement SubsettableTensor as a C++ class and evaluate it using parallel software implementations of 7 machine learning applications on a 48-core AMD Opteron server. Our experiments indicate that data subsetting enables 1.33x to 4.44x performance improvement with <0.5% loss in application-level quality, underscoring its promise as a new approach to approximate computing.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:00

IP2-18, 673

TAMING DATA CACHES FOR PREDICTABLE EXECUTION ON GPU-BASED SOCS
Speaker:
Björn Forsberg, ETH Zürich, CH
Authors:
Björn Forsberg¹, Luca Benini² and Andrea Marongiu³
¹ETH Zürich, CH; ²Università di Bologna, IT; ³University of Bologna, IT
Abstract
Heterogeneous SoCs (HeSoCs) typically share a single DRAM between the CPU and GPU, making workloads susceptible to memory interference, and predictable execution troublesome. State-of-the art predictable execution models (PREM) for HeSoCs prefetch data to the GPU scratchpad memory (SPM), for computations to be insensitive to CPU-generated DRAM traffic. However, the amount of work that the small SPM sizes allow is typically insufficient to absorb CPU/GPU synchronization costs. On-chip caches are larger, and would solve this issue, but have been argued too unpredictable due to self-evictions. We show how self-eviction can be minimized in GPU caches via clever managing of prefetches, thus lowering the performance cost, while retaining timing predictability.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:01

IP2-19, 739

DESIGN AND EVALUATION OF SMALLFLOAT SIMD EXTENSIONS TO THE RISC-V ISA
Speaker:
Giuseppe Tagliavini, University of Bologna, IT
Authors:
Giuseppe Tagliavini¹, Stefan Mach², Davide Rossi³, Andrea Marongiu¹ and Luca Benini⁴
¹University of Bologna, IT; ²ETH Zurich, CH; ³University Of Bologna, IT; ⁴Università di Bologna, IT
Abstract
RISC-V is an open-source instruction set architecture (ISA) with a modular design consisting of a mandatory base part plus optional extensions. The RISC-V 32IMFC ISA configuration has been widely adopted for the design of new-generation, low-power processors. Motivated by the important energy savings that smaller-than-32-bit FP types have enabled in several application domains and related compute platforms, some recent studies have published encouraging early results for their adoption in RISC-V processors. In this paper we introduce a set of ISA extensions for RISC-V 32IMFC, supporting scalar and SIMD operations (fitting the 32-bit register size) for 8-bit and two 16-bit FP types. The proposed extensions are enabled by exposing the new FP types to the standard C/C++ type system and an implementation for the RISC-V GCC compiler is presented. As a further, novel contribution, we extensively characterize the performance and energy savings achievable with the proposed extensions. On average, experimental results show that their adoption provide benefits in terms of performance (1.64x speedup for 16-bit and 2.18x for 8-bit types) and energy consumption (30% saving for 16-bit and 50% for 8-bit types). We also illustrate an approach based on automatic precision tuning to make effective use of the new FP types.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:02

IP2-20, 24

VDARM: DYNAMIC ADAPTIVE RESOURCE MANAGEMENT FOR VIRTUALIZED MULTIPROCESSOR SYSTEMS
Speaker:
Jianmin Qian, Shanghai Jiao Tong University, CN
Authors:
Jianmin Qian, Jian Li, Ruhui Ma and Haibing Guan, Shanghai Jiao Tong University, CN
Abstract
Modern data center servers have been enhancing their computing capacity by increasing processor counts. Meanwhile, these servers are highly virtualized to achieve efficient resource utilization and energy savings. However, due to the shifting of server architecture to non-uniform memory access (NUMA), current hypervisor-level or OS-level resource management methods continue to be challenged in their ability to meet the performance requirement of various user applications. In this work, we first build a performance slowdown model to accurate identify the current system overheads. Based on the model, we finally design a dynamic adaptive virtual resource management method (vDARM) to eliminate the runtime NUMA overheads by re-configuring virtual-to-physical resource mappings. Experiment results show that, compared with state-of-art approaches, vDARM can bring up an average performance improvement of 36.2% on a 8-node NUMA machines. Meanwhile, vDARM only incurs extra CPU utilization no more than 4%.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:00

End of session
Coffee Break in Exhibition Area

Coffee Breaks in the Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Lunch Breaks (Lunch Area)

On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the Lunch Area to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area.

Tuesday, March 26, 2019

Coffee Break 10:30 - 11:30
Lunch Break 13:00 - 14:30
Keynote Lecture "Leonardo da Vinci, Humanism and Engineering between Florence and Milan" by Claudio Giorgione in room 1 13:50 - 14:20
Coffee Break 16:00 - 17:00

Wednesday, March 27, 2019

Coffee Break 10:00 - 11:00
Lunch Break 12:30 - 14:30
Keynote Lecture "Heterogeneous, High Scale Computing in the Era of Intelligent, Cloud-Connected" by David Pellerin, Amazon, US in room 1 13:50 - 14:20
Coffee Break 16:00 - 17:00

Thursday, March 28, 2019

Coffee Break 10:00 - 11:00
University Booth Best Demo Award Presentation at the University Booth 10:30
Lunch Break 12:30 - 14:00
Keynote Lecture "A Fundamental Look at Models and Intelligence" by Edward A. Lee, University of California, Berkeley, US in room 1 13:20 - 13:50
Coffee Break 15:30 - 16:00