9.4 Design Space Exploration

Time	Label	Presentation Title Authors
08:30	9.4.1	AUTOMATIC OPERATING POINT DISTILLATION FOR HYBRID MAPPING METHODOLOGIES Speaker: Behnaz Pourmohseni, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE Authors: Behnaz Pourmohseni¹, Michael Glaß² and Jürgen Teich¹ ¹Friedrich-Alexander-Universität Erlangen-Nürnberg, DE; ²Ulm University, DE Abstract Efficient execution of applications on heterogeneous many-core platforms requires mapping solutions that address different aspects of run-time dynamism like resource availability, energy budgets, and timing requirements. Hybrid mapping methodologies employ a static design space exploration (DSE) to obtain a set of mapping alternatives termed operating points that trade off quality properties (compute performance, energy consumption, etc.) and resource requirements (number of allocated resources of each type, etc.) among which one is selected at run-time by a run-time resource manager (RRM). Given multiple quality properties and the presence of heterogeneous resources, the DSE typically delivers a substantially large set of operating points handling of which may impose an intolerable run-time overhead to the RRM. This paper investigates the problem of truncation of operating points termed operating point distillation, such that (a) an acceptable run-time overhead is achieved, (b) on-line quality requirements are met, and (c) dynamic resource constraints are satisfied, i.e., application embeddability is preserved. We propose an automatic design-time distillation methodology that employs a hyper grid-based approach to retain diverse trade-off options wrt. quality properties, while selecting representative operating points based on their resource requirements to achieve a high level of run-time embeddability. Experimental results for a variety of applications show that compared to existing truncation approaches, proposed methodology significantly enhances the run-time embeddability while achieving a competitive and often improved efficiency in the distilled quality properties. Download Paper (PDF; Only available from the DATE venue WiFi)
09:00	9.4.2	DESIGN SPACE EXPLORATION OF FPGA-BASED ACCELERATORS WITH MULTI-LEVEL PARALLELISM Speaker: Guanwen Zhong, National University of Singapore, SG Authors: Guanwen Zhong¹, Alok Prakash², Siqi Wang¹, Yun (Eric) Liang³, Tulika Mitra¹ and Smail Niar⁴ ¹National University of Singapore, SG; ²Nanyang Technological University, SG; ³Peking University, CN; ⁴LAMIH-University of Valenciennes, FR Abstract Applications containing compute-intensive kernels with nested loops can effectively leverage FPGAs to exploit fine- and coarse-grained parallelism. HLS tools used to translate these kernels from high-level languages (e.g., C/C++), however, are inefficient in exploiting multiple levels of parallelism automatically, thereby producing sub-optimal accelerators. Moreover, the large design space resulting from the various combinations of fine- and coarse-grained parallelism options makes exhaustive design space exploration prohibitively time-consuming with HLS tools. Hence, we propose a rapid estimation framework, MPSeeker, to evaluate performance/area metrics of various accelerator options for an application at an early design phase. Experimental results show that MPSeeker can rapidly (in minutes) explore the complex design space and accurately estimate performance/area of various design points to identify the near-optimal (95.7% performance of the optimal on average) combination of parallelism options. Download Paper (PDF; Only available from the DATE venue WiFi)
09:30	9.4.3	DESIGN SPACE EXPLORATION OF FPGA ACCELERATORS FOR CONVOLUTIONAL NEURAL NETWORKS Speaker: Jongeun Lee, UNIST, KR Authors: Atul Rahman¹, Sangyun Oh², Jongeun Lee³ and Kiyoung Choi⁴ ¹Samsung Electronics, KR; ²UNIST, KR; ³Ulsan National Institute of Science and Technology (UNIST), KR; ⁴Seoul National University, KR Abstract The increasing use of machine learning algorithms, such as Convolutional Neural Networks (CNNs), makes the hardware accelerator approach very compelling. However the question of how to best design an accelerator for a given CNN has not been answered yet, even on a very fundamental level. This paper addresses that challenge, by providing a novel framework that can universally and accurately evaluate and explore various architectural choices for CNN accelerators on FPGAs. Our exploration framework is more extensive than that of any previous work in terms of the design space, and takes into account various FPGA resources to maximize performance including DSP resources, on-chip memory, and off-chip memory bandwidth. Our experimental results using some of the largest CNN models including one that has 16 convolutional layers demonstrate the efficacy of our framework, as well as the need for such a high-level architecture exploration approach to find the best architecture for a CNN model. Download Paper (PDF; Only available from the DATE venue WiFi)
09:45	9.4.4	A SLACK-BASED APPROACH TO EFFICIENTLY DEPLOY RADIX 8 BOOTH MULTIPLIERS Speaker: Alberto Antonio Del Barrio, Universidad Complutense de Madrid, ES Authors: Alberto Antonio Del Barrio Garcia and Hermida Roman, Complutense University of Madrid, ES Abstract In 1951 A. Booth published his algorithm to efficiently multiply signed numbers. Since the appearance of such algorithm, it has been widely accepted that radix 4-based Booth multipliers are the most efficient. They allow the height of the multiplier to be halved, at the expense of a simple recoding that consists of just shifts and negations. Theoretically, higher radix should produce even larger reductions, especially in terms of area and power, but the recoding process is much more complex. Notably, in the case of radix 8 it is necessary to compute 3X, X being the multiplicand. In order to avoid the penalty due to this calculation, we propose decoupling it from the product and considering 3X as an extra operation within the application's Dataflow Graph (DFG). Experiments show that typically there is enough slack in the DFGs to do this without degrading the performance of the circuit, which permits the efficient deployment of radix 8 multipliers that do not calculate the 3X multiple. Results show that our approach is 10% and 17% faster than radix 4 and radix 8 Booth based implementations, respectively, and 12% and 10% more energy efficient in terms of Energy Delay Product. Download Paper (PDF; Only available from the DATE venue WiFi)
10:00	IP4-10, 128	A SCHEDULABILITY TEST FOR SOFTWARE MIGRATION ON MULTICORE SYSTEMS Speaker: Jung-Eun Kim, Department of Computer Science at the University of Illinois at Urbana-Champaign, US Authors: Jung-Eun Kim¹, Richard Bradford², Tarek Abdelzaher³ and Lui Sha³ ¹Department of Computer Science, University of Illinois at Urbana-Champaign, US; ²Rockwell Collins, Cedar Rapids, IA, US; ³University of Illinois, US Abstract This paper presents a new schedulability test for safety-critical software undergoing a transition from single-core to multicore systems - a challenge faced by multiple industries today. Our migration model consists of a schedulability test and execution model. Its properties enable us to obtain a utilization bound that places an allowable limit on total task execution times. Evaluation results demonstrate the advantages of our scheduling model over competing resource partitioning approaches, such as Periodic Server and TDMA. Download Paper (PDF; Only available from the DATE venue WiFi)
10:00		End of session Coffee Break in Exhibition Area On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area. Tuesday, March 28, 2017 Coffee Break 10:30 - 11:30 Coffee Break 16:00 - 17:00 Wednesday, March 29, 2017 Coffee Break 10:00 - 11:00 Coffee Break 16:00 - 17:00 Thursday, March 30, 2017 Coffee Break 10:00 - 11:00 Coffee Break 15:30 - 16:00

Time

Label

Presentation Title
Authors

08:30

9.4.1

AUTOMATIC OPERATING POINT DISTILLATION FOR HYBRID MAPPING METHODOLOGIES
Speaker:
Behnaz Pourmohseni, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE
Authors:
Behnaz Pourmohseni¹, Michael Glaß² and Jürgen Teich¹
¹Friedrich-Alexander-Universität Erlangen-Nürnberg, DE; ²Ulm University, DE
Abstract
Efficient execution of applications on heterogeneous many-core platforms requires mapping solutions that address different aspects of run-time dynamism like resource availability, energy budgets, and timing requirements. Hybrid mapping methodologies employ a static design space exploration (DSE) to obtain a set of mapping alternatives termed operating points that trade off quality properties (compute performance, energy consumption, etc.) and resource requirements (number of allocated resources of each type, etc.) among which one is selected at run-time by a run-time resource manager (RRM). Given multiple quality properties and the presence of heterogeneous resources, the DSE typically delivers a substantially large set of operating points handling of which may impose an intolerable run-time overhead to the RRM. This paper investigates the problem of truncation of operating points termed operating point distillation, such that (a) an acceptable run-time overhead is achieved, (b) on-line quality requirements are met, and (c) dynamic resource constraints are satisfied, i.e., application embeddability is preserved. We propose an automatic design-time distillation methodology that employs a hyper grid-based approach to retain diverse trade-off options wrt. quality properties, while selecting representative operating points based on their resource requirements to achieve a high level of run-time embeddability. Experimental results for a variety of applications show that compared to existing truncation approaches, proposed methodology significantly enhances the run-time embeddability while achieving a competitive and often improved efficiency in the distilled quality properties.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:00

9.4.2

DESIGN SPACE EXPLORATION OF FPGA-BASED ACCELERATORS WITH MULTI-LEVEL PARALLELISM
Speaker:
Guanwen Zhong, National University of Singapore, SG
Authors:
Guanwen Zhong¹, Alok Prakash², Siqi Wang¹, Yun (Eric) Liang³, Tulika Mitra¹ and Smail Niar⁴
¹National University of Singapore, SG; ²Nanyang Technological University, SG; ³Peking University, CN; ⁴LAMIH-University of Valenciennes, FR
Abstract
Applications containing compute-intensive kernels with nested loops can effectively leverage FPGAs to exploit fine- and coarse-grained parallelism. HLS tools used to translate these kernels from high-level languages (e.g., C/C++), however, are inefficient in exploiting multiple levels of parallelism automatically, thereby producing sub-optimal accelerators. Moreover, the large design space resulting from the various combinations of fine- and coarse-grained parallelism options makes exhaustive design space exploration prohibitively time-consuming with HLS tools. Hence, we propose a rapid estimation framework, MPSeeker, to evaluate performance/area metrics of various accelerator options for an application at an early design phase. Experimental results show that MPSeeker can rapidly (in minutes) explore the complex design space and accurately estimate performance/area of various design points to identify the near-optimal (95.7% performance of the optimal on average) combination of parallelism options.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:30

9.4.3

DESIGN SPACE EXPLORATION OF FPGA ACCELERATORS FOR CONVOLUTIONAL NEURAL NETWORKS
Speaker:
Jongeun Lee, UNIST, KR
Authors:
Atul Rahman¹, Sangyun Oh², Jongeun Lee³ and Kiyoung Choi⁴
¹Samsung Electronics, KR; ²UNIST, KR; ³Ulsan National Institute of Science and Technology (UNIST), KR; ⁴Seoul National University, KR
Abstract
The increasing use of machine learning algorithms, such as Convolutional Neural Networks (CNNs), makes the hardware accelerator approach very compelling. However the question of how to best design an accelerator for a given CNN has not been answered yet, even on a very fundamental level. This paper addresses that challenge, by providing a novel framework that can universally and accurately evaluate and explore various architectural choices for CNN accelerators on FPGAs. Our exploration framework is more extensive than that of any previous work in terms of the design space, and takes into account various FPGA resources to maximize performance including DSP resources, on-chip memory, and off-chip memory bandwidth. Our experimental results using some of the largest CNN models including one that has 16 convolutional layers demonstrate the efficacy of our framework, as well as the need for such a high-level architecture exploration approach to find the best architecture for a CNN model.
Download Paper (PDF; Only available from the DATE venue WiFi)

09:45

9.4.4

A SLACK-BASED APPROACH TO EFFICIENTLY DEPLOY RADIX 8 BOOTH MULTIPLIERS
Speaker:
Alberto Antonio Del Barrio, Universidad Complutense de Madrid, ES
Authors:
Alberto Antonio Del Barrio Garcia and Hermida Roman, Complutense University of Madrid, ES
Abstract
In 1951 A. Booth published his algorithm to efficiently multiply signed numbers. Since the appearance of such algorithm, it has been widely accepted that radix 4-based Booth multipliers are the most efficient. They allow the height of the multiplier to be halved, at the expense of a simple recoding that consists of just shifts and negations. Theoretically, higher radix should produce even larger reductions, especially in terms of area and power, but the recoding process is much more complex. Notably, in the case of radix 8 it is necessary to compute 3X, X being the multiplicand. In order to avoid the penalty due to this calculation, we propose decoupling it from the product and considering 3X as an extra operation within the application's Dataflow Graph (DFG). Experiments show that typically there is enough slack in the DFGs to do this without degrading the performance of the circuit, which permits the efficient deployment of radix 8 multipliers that do not calculate the 3X multiple. Results show that our approach is 10% and 17% faster than radix 4 and radix 8 Booth based implementations, respectively, and 12% and 10% more energy efficient in terms of Energy Delay Product.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:00

IP4-10, 128

A SCHEDULABILITY TEST FOR SOFTWARE MIGRATION ON MULTICORE SYSTEMS
Speaker:
Jung-Eun Kim, Department of Computer Science at the University of Illinois at Urbana-Champaign, US
Authors:
Jung-Eun Kim¹, Richard Bradford², Tarek Abdelzaher³ and Lui Sha³
¹Department of Computer Science, University of Illinois at Urbana-Champaign, US; ²Rockwell Collins, Cedar Rapids, IA, US; ³University of Illinois, US
Abstract
This paper presents a new schedulability test for safety-critical software undergoing a transition from single-core to multicore systems - a challenge faced by multiple industries today. Our migration model consists of a schedulability test and execution model. Its properties enable us to obtain a utilization bound that places an allowable limit on total task execution times. Evaluation results demonstrate the advantages of our scheduling model over competing resource partitioning approaches, such as Periodic Server and TDMA.
Download Paper (PDF; Only available from the DATE venue WiFi)

10:00

End of session
Coffee Break in Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017