2.5 GPU and GPU-based heterogeneous system management

Time	Label	Presentation Title Authors
11:30	2.5.1	(Best Paper Award Candidate) HVSM: HARDWARE-VARIABILITY AWARE STREAMING PROCESSORS' MANAGEMENT POLICY IN GPUS Speaker: Jingweijia Tan, Jilin University, CN Authors: Jingweijia Tan¹ and Kaige Yan² ¹Jilin University, CN; ²College of Communication Engineering, Jilin University, CN Abstract GPUs are widely used in general-purpose high performance computing field due to their highly parallel architecture. In recent years, a new era with nanometer scale integrated circuit manufacture process has come, as a consequence, GPUs' computation capability gets even stronger. However, as process technology scales down, hardware variability, e.g., process variations (PVs) and negative bias temperature instability (NBTI), has a higher impact on the chip quality. The parallelism of GPU desires high consistency of hardware units on chip, otherwise, the worst unit will inevitably become the bottleneck. So the hardware variability becomes a pressing concern to further improve GPUs' performance and lifetime, not only in integrated circuit fabrication, but more in GPU architecture design. Streaming Processors (SPs) are the key units in GPUs, which perform most of parallel computing operations. Therefore, in this work, we focus on mitigating the impact of hardware variability in GPU SPs. We first model and analyze SPs' performance variations under hardware variability. Then, we observe that both PV and NBTI have large impact on SP's performance. We further observe unbalanced SP utilization, e.g., some SPs are idle when others are active, during program execution. Leveraging both observations, we propose a Hardware Variability-aware SPs' Management policy (HVSM), which dynamically prioritizes the fast SPs, regroups SPs in a two-level granularity and dispatches computation in appropriate SPs. Our experimental results show HVSM effectively reduces the impact of hardware variability, which can translate to 28% performance improvement or 14.4% lifetime extension for a GPU chip. Download Paper (PDF; Only available from the DATE venue WiFi)
12:00	2.5.2	THROUGHPUT OPTIMIZATION AND RESOURCE ALLOCATION ON GPUS UNDER MULTI-APPLICATION EXECUTION Speaker: Iraklis Anagnostopoulos, Southern Illinois University Carbondale, US Authors: Srinivasa Reddy Punyala, Theodoros Marinakis, Arash Komaee and Iraklis Anagnostopoulos, Southern Illinois University Carbondale, US Abstract Platform heterogeneity prevails as a solution to the throughput and computational challenges imposed by parallel applications and technology scaling. Specifically, Graphics Processing Units (GPUs) are based on the Single Instruction Multiple Thread (SIMT) paradigm and they can offer tremendous speed-up for parallel applications. However, GPUs were designed to execute a single application at a time. In case of simultaneous multi-application execution, due to the GPUs' massive multi-threading paradigm, applications compete against each other using destructively the shared resources (caches and memory controllers) resulting in significant throughput degradation. In this paper, a methodology for minimizing interference in shared resources and provide efficient concurrent execution of multiple applications on GPUs is presented. Particularly, the proposed methodology (i) performs application classification; (ii) analyzes the per-class interference; (iii) finds the best matching between classes; and (iv) employs an efficient resource allocation. Experimental results showed that the proposed approach increases the throughput of the system for two concurrent applications by an average of 36% compared to the default execution and 10% compared to an exahustive profile-based optimization technique. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30	2.5.3	SET VARIATION-AWARE SHARED LAST-LEVEL CACHE MANAGEMENT FOR CPU-GPU HETEROGENEOUS ARCHITECTURE Speaker: Xin Li, Shandong University, CN Authors: Zhaoying Li, Lei Ju, Hongjun Dai, Xin Li, Mengying Zhao and Zhiping Jia, Shandong University, CN Abstract Heterogeneous CPU-GPU multiprocessor systems-on-chip (HMPSoC) becomes a popular architecture choice for high performance embedded systems, where shared last-level cache (LLC) management becomes a critical design consideration. We observe that within a sampling period, CPU and GPU may have distinct access behaviors over various LLC sets. In this work, we propose a light-weighted and fined-grained cache management policy to cope with the CPU-GPU access behavior variation among cache sets. In particular, CPU and GPU requests are prioritized disparately in each LLC set during cache block insertion and promotion, based on the per-core utility behaviors and a per-set CPU-GPU miss counter. Experimental results show that our LLC management scheme outperforms the two state-of-the-art schemes TAP-RRIP and LSP by 12.6% and 10.01%, respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
13:00	IP1-4, 464	HPXA: A HIGHLY PARALLEL XML PARSER Speaker: Smruti Sarangi, IIT Delhi, IN Authors: Isaar Ahmad, Sanjog Patil and Smruti R. Sarangi, IIT Delhi, IN Abstract State of the art XML parsing approaches read an XML file byte by byte, and use complex finite state machines to process each byte. In this paper, we propose a new parser, HPXA, which reads and processes 16 bytes at a time. We designed most of the components ab initio, to ensure that they can process multiple XML tokens and tags in parallel. We propose two basic elements - a sparse 1D array compactor, and a hardware unit called LTMAdder that takes its decisions based on adding the rows of a lower triangular matrix. We demonstrate that we are able to process 16 bytes in parallel with very few pipeline stalls for a suite of widely used XML benchmarks. Moreover, for a 28nm technology node, we can process XML data at 106 Gbps, which is roughly 6.5X faster than competing prior work. Download Paper (PDF; Only available from the DATE venue WiFi)
13:00		End of session Lunch Break in Großer Saal and Saal 1 Coffee Breaks in the Exhibition Area On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area (Terrace Level of the ICCD). Lunch Breaks (Großer Saal + Saal 1) On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the rooms "Großer Saal" and "Saal 1" (Saal Level of the ICCD) to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area. Tuesday, March 20, 2018 Coffee Break 10:30 - 11:30 Lunch Break 13:00 - 14:30 Awards Presentation and Keynote Lecture in "Saal 2" 13:50 - 14:20 Coffee Break 16:00 - 17:00 Wednesday, March 21, 2018 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:30 Awards Presentation and Keynote Lecture in "Saal 2" 13:30 - 14:20 Coffee Break 16:00 - 17:00 Thursday, March 22, 2018 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:00 Keynote Lecture in "Saal 2" 13:20 - 13:50 Coffee Break 15:30 - 16:00

Time

Label

Presentation Title
Authors

11:30

2.5.1

(Best Paper Award Candidate)
HVSM: HARDWARE-VARIABILITY AWARE STREAMING PROCESSORS' MANAGEMENT POLICY IN GPUS
Speaker:
Jingweijia Tan, Jilin University, CN
Authors:
Jingweijia Tan¹ and Kaige Yan²
¹Jilin University, CN; ²College of Communication Engineering, Jilin University, CN
Abstract
GPUs are widely used in general-purpose high performance computing field due to their highly parallel architecture. In recent years, a new era with nanometer scale integrated circuit manufacture process has come, as a consequence, GPUs' computation capability gets even stronger. However, as process technology scales down, hardware variability, e.g., process variations (PVs) and negative bias temperature instability (NBTI), has a higher impact on the chip quality. The parallelism of GPU desires high consistency of hardware units on chip, otherwise, the worst unit will inevitably become the bottleneck. So the hardware variability becomes a pressing concern to further improve GPUs' performance and lifetime, not only in integrated circuit fabrication, but more in GPU architecture design. Streaming Processors (SPs) are the key units in GPUs, which perform most of parallel computing operations. Therefore, in this work, we focus on mitigating the impact of hardware variability in GPU SPs. We first model and analyze SPs' performance variations under hardware variability. Then, we observe that both PV and NBTI have large impact on SP's performance. We further observe unbalanced SP utilization, e.g., some SPs are idle when others are active, during program execution. Leveraging both observations, we propose a Hardware Variability-aware SPs' Management policy (HVSM), which dynamically prioritizes the fast SPs, regroups SPs in a two-level granularity and dispatches computation in appropriate SPs. Our experimental results show HVSM effectively reduces the impact of hardware variability, which can translate to 28% performance improvement or 14.4% lifetime extension for a GPU chip.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:00

2.5.2

THROUGHPUT OPTIMIZATION AND RESOURCE ALLOCATION ON GPUS UNDER MULTI-APPLICATION EXECUTION
Speaker:
Iraklis Anagnostopoulos, Southern Illinois University Carbondale, US
Authors:
Srinivasa Reddy Punyala, Theodoros Marinakis, Arash Komaee and Iraklis Anagnostopoulos, Southern Illinois University Carbondale, US
Abstract
Platform heterogeneity prevails as a solution to the throughput and computational challenges imposed by parallel applications and technology scaling. Specifically, Graphics Processing Units (GPUs) are based on the Single Instruction Multiple Thread (SIMT) paradigm and they can offer tremendous speed-up for parallel applications. However, GPUs were designed to execute a single application at a time. In case of simultaneous multi-application execution, due to the GPUs' massive multi-threading paradigm, applications compete against each other using destructively the shared resources (caches and memory controllers) resulting in significant throughput degradation. In this paper, a methodology for minimizing interference in shared resources and provide efficient concurrent execution of multiple applications on GPUs is presented. Particularly, the proposed methodology (i) performs application classification; (ii) analyzes the per-class interference; (iii) finds the best matching between classes; and (iv) employs an efficient resource allocation. Experimental results showed that the proposed approach increases the throughput of the system for two concurrent applications by an average of 36% compared to the default execution and 10% compared to an exahustive profile-based optimization technique.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

2.5.3

SET VARIATION-AWARE SHARED LAST-LEVEL CACHE MANAGEMENT FOR CPU-GPU HETEROGENEOUS ARCHITECTURE
Speaker:
Xin Li, Shandong University, CN
Authors:
Zhaoying Li, Lei Ju, Hongjun Dai, Xin Li, Mengying Zhao and Zhiping Jia, Shandong University, CN
Abstract
Heterogeneous CPU-GPU multiprocessor systems-on-chip (HMPSoC) becomes a popular architecture choice for high performance embedded systems, where shared last-level cache (LLC) management becomes a critical design consideration. We observe that within a sampling period, CPU and GPU may have distinct access behaviors over various LLC sets. In this work, we propose a light-weighted and fined-grained cache management policy to cope with the CPU-GPU access behavior variation among cache sets. In particular, CPU and GPU requests are prioritized disparately in each LLC set during cache block insertion and promotion, based on the per-core utility behaviors and a per-set CPU-GPU miss counter. Experimental results show that our LLC management scheme outperforms the two state-of-the-art schemes TAP-RRIP and LSP by 12.6% and 10.01%, respectively.
Download Paper (PDF; Only available from the DATE venue WiFi)

13:00

IP1-4, 464

HPXA: A HIGHLY PARALLEL XML PARSER
Speaker:
Smruti Sarangi, IIT Delhi, IN
Authors:
Isaar Ahmad, Sanjog Patil and Smruti R. Sarangi, IIT Delhi, IN
Abstract
State of the art XML parsing approaches read an XML file byte by byte, and use complex finite state machines to process each byte. In this paper, we propose a new parser, HPXA, which reads and processes 16 bytes at a time. We designed most of the components ab initio, to ensure that they can process multiple XML tokens and tags in parallel. We propose two basic elements - a sparse 1D array compactor, and a hardware unit called LTMAdder that takes its decisions based on adding the rows of a lower triangular matrix. We demonstrate that we are able to process 16 bytes in parallel with very few pipeline stalls for a suite of widely used XML benchmarks. Moreover, for a 28nm technology node, we can process XML data at 106 Gbps, which is roughly 6.5X faster than competing prior work.
Download Paper (PDF; Only available from the DATE venue WiFi)

13:00

End of session
Lunch Break in Großer Saal and Saal 1

Coffee Breaks in the Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area (Terrace Level of the ICCD).

Lunch Breaks (Großer Saal + Saal 1)

On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the rooms "Großer Saal" and "Saal 1" (Saal Level of the ICCD) to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area.

Tuesday, March 20, 2018