6.4 Hardware support for microarchitecture performance

Time	Label	Presentation Title Authors
11:00	6.4.1	MAXIMUM-CONTENTION CONTROL UNIT (MCCU): RESOURCE ACCESS COUNT AND CONTENTION TIME ENFORCEMENT Speaker: Jordi Cardona, Univ. Politècnica de Barcelona and Barcelona Supercomputing Center, ES Authors: Jordi Cardona¹, Carles Hernandez², Jaume Abella² and Francisco Cazorla² ¹Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, ES; ²Barcelona Supercomputing Center, ES Abstract In real-time systems, techniques to derive bounds to the contention tasks can suffer in multicore build on resource quota monitoring and enforcement. In particular, they track and bound the number of requests to hardware shared resources that each core (task) is allowed to perform. In this paper, we show that current software-only solutions work well when there is a single resource and type of request to track and bound, but do not scale to the more general case of several shared resources that accept different request types, each with a different associated latency. To handle this (more general) case, we propose low-overhead hardware support called Maximum-Contention Control Unit (MCCU). The MCCU performs fine-grain tracking of different types of requests, preventing a core to cause more interference on its contenders than budgeted. In this process, the MCCU also helps verifying that individual requests duration does not exceed their theoretical bounds, hence dealing with scenarios in which requests can have an arbitrarily large duration. Download Paper (PDF; Only available from the DATE venue WiFi)
11:30	6.4.2	FIFORDER MICROARCHITECTURE: READY-AWARE INSTRUCTION SCHEDULING FOR OOO PROCESSORS Speaker: Mehdi Alipour, Uppsala University, SE Authors: Mehdi Alipour¹, Rakesh Kumar², Stefanos Kaxiras¹ and David Black-Schaffer¹ ¹Uppsala University, SE; ²Norwegian University of Science and Technology, NO Abstract The number of instructions a processor's instruction queue can examine (depth) and the number it can issue together (width) determine its ability to take advantage of the ILP in an application. Unfortunately, increasing either the width or depth of the instruction queue is very costly due to the content-addressable logic needed to wakeup and select instructions out-of-order. This work makes the observation that a large number of instructions have both operands ready at dispatch, and therefore do not benefit from out-of-order scheduling. We leverage this to place such ready-at-dispatch instructions in separate, simpler, in-order FIFO queues for scheduling. With such additional queues, we can reduce the size and width of the expensive out-of-order instruction queue, without reducing the processor's overall issue width and depth. Our design, FIFOrder, is able to steer more than 60% of instructions to the cheaper FIFO queues, providing a 50% energy savings over a traditional out-of-order instruction queue design, while delivering 8% higher performance. Download Paper (PDF; Only available from the DATE venue WiFi)
12:00	6.4.3	BOOSTING SIMD BENEFITS THROUGH A RUN-TIME AND ENERGY EFFICIENT DLP DETECTION Speaker: Mateus Rutzig, UFSM, BR Authors: Michael Jordan, Tiago Knorst, Julio Vicenzi and Mateus Beck Rutzig, UFSM, BR Abstract Data Level Parallelism has been improving performance-energy tradeoff of current processors by coupling SIMD engines, such as Intel AVX and ARM NEON. Special libraries and compilers are used to support DLP execution on such engines. However, timing overhead on hand coding is inevitable since most software developers are not skilled to extract DLP using unfamiliar libraries. In addition, DLP detection through compiler, besides breaking software compatibility, is limited to static code analysis, which compromises performance gains. In this work, we propose a runtime DLP detection named as Dynamic SIMD Assembler, which transparently identifies vectorizable code regions to execute in the ARM NEON engine. Due to its dynamic fashion, DSA keeps software compatibility and avoids timing overhead on software developing process. Results have shown that DSA outperforms ARM NEON auto-vectorization compiler by 32% since it covers wider vectorized regions, such as Dynamic Range, Sentinel and Conditional Loops. In addition, DSA outperforms hand-vectorized code using ARM library by 26% reducing 45% of energy consumption with no penalties over software development time. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30	IP3-2, 336	DEPENDENCY-RESOLVING INTRA-UNIT PIPELINE ARCHITECTURE FOR HIGH-THROUGHPUT MULTIPLIERS Speaker: Dae Hyun Kim, Washington State University, US Authors: Jihee Seo and Dae Hyun Kim, Washington State University, US Abstract In this paper, we propose two dependency-resolving intra-unit pipeline architectures to design high-throughput multipliers. Simulation results show that the proposed multipliers achieve approximately 2.3× to 3.1× execution time reduction at a cost of 4.4% area and 3.7% power overheads for highly-dependent multiplications. Download Paper (PDF; Only available from the DATE venue WiFi)
12:31	IP3-3, 832	A HARDWARE-EFFICIENT LOGARITHMIC MULTIPLIER WITH IMPROVED ACCURACY Authors: Mohammad Saeed Ansari, Bruce Cockburn and Jie Han, University of Alberta, CA Abstract Logarithmic multipliers take the base-2 logarithm of the operands and perform multiplication by only using shift and addition operations. Since computing the logarithm is often an approximate process, some accuracy loss is inevitable in such designs. However, the area, latency, and power consumption can be significantly improved at the cost of accuracy loss. This paper presents a novel method to approximate log_2N that, unlike the existing approaches, rounds N to its nearest power of two instead of the highest power of two smaller than or equal to N. This approximation technique is then used to design two improved 16x16 logarithmic multipliers that use exact and approximate adders (ILM-EA and ILM-AA, respectively). These multipliers achieve up to 24.42% and 9.82% savings in area and power-delay product, respectively, compared to the state-of-the-art design in the literature with similar accuracy. The proposed designs are evaluated in the Joint Photographic Experts Group (JPEG) image compression algorithm and their advantages over other approximate logarithmic multipliers are shown. Download Paper (PDF; Only available from the DATE venue WiFi)
12:32	IP3-4, 440	LIGHTWEIGHT HARDWARE SUPPORT FOR SELECTIVE COHERENCE IN HETEROGENEOUS MANYCORE ACCELERATORS Speaker: Alessandro Cilardo, CeRICT, IT Authors: Alessandro Cilardo, Mirko Gagliardi and Vincenzo Scotti, University of Naples Federico II, IT Abstract Shared memory coherence is a key feature in manycore accelerators, ensuring programmability and application portability. Most established solutions for coherence in homogeneous systems cannot be simply reused because of the special requirements of accelerator architectures. This paper introduces a low-overhead hardware coherence system for heterogeneous accelerators, with customizable granularity and noncoherent region support. The coherence system has been demonstrated in operation in a full manycore accelerator, exhibiting significant improvements in terms of network load, execution time, and power consumption. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30		End of session Lunch Break in Lunch Area Coffee Breaks in the Exhibition Area On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area. Lunch Breaks (Lunch Area) On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the Lunch Area to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area. Tuesday, March 26, 2019 Coffee Break 10:30 - 11:30 Lunch Break 13:00 - 14:30 Keynote Lecture "Leonardo da Vinci, Humanism and Engineering between Florence and Milan" by Claudio Giorgione in room 1 13:50 - 14:20 Coffee Break 16:00 - 17:00 Wednesday, March 27, 2019 Coffee Break 10:00 - 11:00 Lunch Break 12:30 - 14:30 Keynote Lecture "Heterogeneous, High Scale Computing in the Era of Intelligent, Cloud-Connected" by David Pellerin, Amazon, US in room 1 13:50 - 14:20 Coffee Break 16:00 - 17:00 Thursday, March 28, 2019 Coffee Break 10:00 - 11:00 University Booth Best Demo Award Presentation at the University Booth 10:30 Lunch Break 12:30 - 14:00 Keynote Lecture "A Fundamental Look at Models and Intelligence" by Edward A. Lee, University of California, Berkeley, US in room 1 13:20 - 13:50 Coffee Break 15:30 - 16:00

Time

Label

Presentation Title
Authors

11:00

6.4.1

MAXIMUM-CONTENTION CONTROL UNIT (MCCU): RESOURCE ACCESS COUNT AND CONTENTION TIME ENFORCEMENT
Speaker:
Jordi Cardona, Univ. Politècnica de Barcelona and Barcelona Supercomputing Center, ES
Authors:
Jordi Cardona¹, Carles Hernandez², Jaume Abella² and Francisco Cazorla²
¹Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, ES; ²Barcelona Supercomputing Center, ES
Abstract
In real-time systems, techniques to derive bounds to the contention tasks can suffer in multicore build on resource quota monitoring and enforcement. In particular, they track and bound the number of requests to hardware shared resources that each core (task) is allowed to perform. In this paper, we show that current software-only solutions work well when there is a single resource and type of request to track and bound, but do not scale to the more general case of several shared resources that accept different request types, each with a different associated latency. To handle this (more general) case, we propose low-overhead hardware support called Maximum-Contention Control Unit (MCCU). The MCCU performs fine-grain tracking of different types of requests, preventing a core to cause more interference on its contenders than budgeted. In this process, the MCCU also helps verifying that individual requests duration does not exceed their theoretical bounds, hence dealing with scenarios in which requests can have an arbitrarily large duration.
Download Paper (PDF; Only available from the DATE venue WiFi)

11:30

6.4.2

FIFORDER MICROARCHITECTURE: READY-AWARE INSTRUCTION SCHEDULING FOR OOO PROCESSORS
Speaker:
Mehdi Alipour, Uppsala University, SE
Authors:
Mehdi Alipour¹, Rakesh Kumar², Stefanos Kaxiras¹ and David Black-Schaffer¹
¹Uppsala University, SE; ²Norwegian University of Science and Technology, NO
Abstract
The number of instructions a processor's instruction queue can examine (depth) and the number it can issue together (width) determine its ability to take advantage of the ILP in an application. Unfortunately, increasing either the width or depth of the instruction queue is very costly due to the content-addressable logic needed to wakeup and select instructions out-of-order. This work makes the observation that a large number of instructions have both operands ready at dispatch, and therefore do not benefit from out-of-order scheduling. We leverage this to place such ready-at-dispatch instructions in separate, simpler, in-order FIFO queues for scheduling. With such additional queues, we can reduce the size and width of the expensive out-of-order instruction queue, without reducing the processor's overall issue width and depth. Our design, FIFOrder, is able to steer more than 60% of instructions to the cheaper FIFO queues, providing a 50% energy savings over a traditional out-of-order instruction queue design, while delivering 8% higher performance.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:00

6.4.3

BOOSTING SIMD BENEFITS THROUGH A RUN-TIME AND ENERGY EFFICIENT DLP DETECTION
Speaker:
Mateus Rutzig, UFSM, BR
Authors:
Michael Jordan, Tiago Knorst, Julio Vicenzi and Mateus Beck Rutzig, UFSM, BR
Abstract
Data Level Parallelism has been improving performance-energy tradeoff of current processors by coupling SIMD engines, such as Intel AVX and ARM NEON. Special libraries and compilers are used to support DLP execution on such engines. However, timing overhead on hand coding is inevitable since most software developers are not skilled to extract DLP using unfamiliar libraries. In addition, DLP detection through compiler, besides breaking software compatibility, is limited to static code analysis, which compromises performance gains. In this work, we propose a runtime DLP detection named as Dynamic SIMD Assembler, which transparently identifies vectorizable code regions to execute in the ARM NEON engine. Due to its dynamic fashion, DSA keeps software compatibility and avoids timing overhead on software developing process. Results have shown that DSA outperforms ARM NEON auto-vectorization compiler by 32% since it covers wider vectorized regions, such as Dynamic Range, Sentinel and Conditional Loops. In addition, DSA outperforms hand-vectorized code using ARM library by 26% reducing 45% of energy consumption with no penalties over software development time.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

IP3-2, 336

DEPENDENCY-RESOLVING INTRA-UNIT PIPELINE ARCHITECTURE FOR HIGH-THROUGHPUT MULTIPLIERS
Speaker:
Dae Hyun Kim, Washington State University, US
Authors:
Jihee Seo and Dae Hyun Kim, Washington State University, US
Abstract
In this paper, we propose two dependency-resolving intra-unit pipeline architectures to design high-throughput multipliers. Simulation results show that the proposed multipliers achieve approximately 2.3× to 3.1× execution time reduction at a cost of 4.4% area and 3.7% power overheads for highly-dependent multiplications.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:31

IP3-3, 832

A HARDWARE-EFFICIENT LOGARITHMIC MULTIPLIER WITH IMPROVED ACCURACY
Authors:
Mohammad Saeed Ansari, Bruce Cockburn and Jie Han, University of Alberta, CA
Abstract
Logarithmic multipliers take the base-2 logarithm of the operands and perform multiplication by only using shift and addition operations. Since computing the logarithm is often an approximate process, some accuracy loss is inevitable in such designs. However, the area, latency, and power consumption can be significantly improved at the cost of accuracy loss. This paper presents a novel method to approximate log_2N that, unlike the existing approaches, rounds N to its nearest power of two instead of the highest power of two smaller than or equal to N. This approximation technique is then used to design two improved 16x16 logarithmic multipliers that use exact and approximate adders (ILM-EA and ILM-AA, respectively). These multipliers achieve up to 24.42% and 9.82% savings in area and power-delay product, respectively, compared to the state-of-the-art design in the literature with similar accuracy. The proposed designs are evaluated in the Joint Photographic Experts Group (JPEG) image compression algorithm and their advantages over other approximate logarithmic multipliers are shown.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:32

IP3-4, 440

LIGHTWEIGHT HARDWARE SUPPORT FOR SELECTIVE COHERENCE IN HETEROGENEOUS MANYCORE ACCELERATORS
Speaker:
Alessandro Cilardo, CeRICT, IT
Authors:
Alessandro Cilardo, Mirko Gagliardi and Vincenzo Scotti, University of Naples Federico II, IT
Abstract
Shared memory coherence is a key feature in manycore accelerators, ensuring programmability and application portability. Most established solutions for coherence in homogeneous systems cannot be simply reused because of the special requirements of accelerator architectures. This paper introduces a low-overhead hardware coherence system for heterogeneous accelerators, with customizable granularity and noncoherent region support. The coherence system has been demonstrated in operation in a full manycore accelerator, exhibiting significant improvements in terms of network load, execution time, and power consumption.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

End of session
Lunch Break in Lunch Area

Coffee Breaks in the Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Lunch Breaks (Lunch Area)

On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the Lunch Area to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area.

Tuesday, March 26, 2019

Coffee Break 10:30 - 11:30
Lunch Break 13:00 - 14:30
Keynote Lecture "Leonardo da Vinci, Humanism and Engineering between Florence and Milan" by Claudio Giorgione in room 1 13:50 - 14:20
Coffee Break 16:00 - 17:00

Wednesday, March 27, 2019

Coffee Break 10:00 - 11:00
Lunch Break 12:30 - 14:30
Keynote Lecture "Heterogeneous, High Scale Computing in the Era of Intelligent, Cloud-Connected" by David Pellerin, Amazon, US in room 1 13:50 - 14:20
Coffee Break 16:00 - 17:00

Thursday, March 28, 2019

Coffee Break 10:00 - 11:00
University Booth Best Demo Award Presentation at the University Booth 10:30
Lunch Break 12:30 - 14:00
Keynote Lecture "A Fundamental Look at Models and Intelligence" by Edward A. Lee, University of California, Berkeley, US in room 1 13:20 - 13:50
Coffee Break 15:30 - 16:00