9.4 Design Space Exploration

Printer-friendly version PDF version

Date: Thursday 30 March 2017
Time: 08:30 - 10:00
Location / Room: 3A

Chair:
Lars Bauer, KIT Karlsruhe, DE

Co-Chair:
Alberto Del Barrio, Universidad Computense de Madrid, ES

This session features methods that extract desired implementation options from the huge design space of digital systems. The first talk presents a method to pick valuable operating points from a Pareto optimal set of task mappings for an efficient online resource management. The second presentation presents a rapid estimation framework to evaluate performance/area metrics of various accelerator options for an application at an early design phase. A design space exploration for implementing convolutional layers of neural networks is presented in the third talk in order to maximize the performance. The fourth talk presents an HLS scheduling method that is optimized for incorporating Radix 8 Booth multipliers. The session concludes with two short introductions of interactive presentations.

TimeLabelPresentation Title
Authors
08:309.4.1AUTOMATIC OPERATING POINT DISTILLATION FOR HYBRID MAPPING METHODOLOGIES
Speaker:
Behnaz Pourmohseni, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE
Authors:
Behnaz Pourmohseni1, Michael Glaß2 and Jürgen Teich1
1Friedrich-Alexander-Universität Erlangen-Nürnberg, DE; 2Ulm University, DE
Abstract
Efficient execution of applications on heterogeneous many-core platforms requires mapping solutions that address different aspects of run-time dynamism like resource availability, energy budgets, and timing requirements. Hybrid mapping methodologies employ a static design space exploration (DSE) to obtain a set of mapping alternatives termed operating points that trade off quality properties (compute performance, energy consumption, etc.) and resource requirements (number of allocated resources of each type, etc.) among which one is selected at run-time by a run-time resource manager (RRM). Given multiple quality properties and the presence of heterogeneous resources, the DSE typically delivers a substantially large set of operating points handling of which may impose an intolerable run-time overhead to the RRM. This paper investigates the problem of truncation of operating points termed operating point distillation, such that (a) an acceptable run-time overhead is achieved, (b) on-line quality requirements are met, and (c) dynamic resource constraints are satisfied, i.e., application embeddability is preserved. We propose an automatic design-time distillation methodology that employs a hyper grid-based approach to retain diverse trade-off options wrt. quality properties, while selecting representative operating points based on their resource requirements to achieve a high level of run-time embeddability. Experimental results for a variety of applications show that compared to existing truncation approaches, proposed methodology significantly enhances the run-time embeddability while achieving a competitive and often improved efficiency in the distilled quality properties.

Download Paper (PDF; Only available from the DATE venue WiFi)
09:009.4.2DESIGN SPACE EXPLORATION OF FPGA-BASED ACCELERATORS WITH MULTI-LEVEL PARALLELISM
Speaker:
Guanwen Zhong, National University of Singapore, SG
Authors:
Guanwen Zhong1, Alok Prakash2, Siqi Wang1, Yun (Eric) Liang3, Tulika Mitra1 and Smail Niar4
1National University of Singapore, SG; 2Nanyang Technological University, SG; 3Peking University, CN; 4LAMIH-University of Valenciennes, FR
Abstract
Applications containing compute-intensive kernels with nested loops can effectively leverage FPGAs to exploit fine- and coarse-grained parallelism. HLS tools used to translate these kernels from high-level languages (e.g., C/C++), however, are inefficient in exploiting multiple levels of parallelism automatically, thereby producing sub-optimal accelerators. Moreover, the large design space resulting from the various combinations of fine- and coarse-grained parallelism options makes exhaustive design space exploration prohibitively time-consuming with HLS tools. Hence, we propose a rapid estimation framework, MPSeeker, to evaluate performance/area metrics of various accelerator options for an application at an early design phase. Experimental results show that MPSeeker can rapidly (in minutes) explore the complex design space and accurately estimate performance/area of various design points to identify the near-optimal (95.7% performance of the optimal on average) combination of parallelism options.

Download Paper (PDF; Only available from the DATE venue WiFi)
09:309.4.3DESIGN SPACE EXPLORATION OF FPGA ACCELERATORS FOR CONVOLUTIONAL NEURAL NETWORKS
Speaker:
Jongeun Lee, UNIST, KR
Authors:
Atul Rahman1, Sangyun Oh2, Jongeun Lee3 and Kiyoung Choi4
1Samsung Electronics, KR; 2UNIST, KR; 3Ulsan National Institute of Science and Technology (UNIST), KR; 4Seoul National University, KR
Abstract
The increasing use of machine learning algorithms, such as Convolutional Neural Networks (CNNs), makes the hardware accelerator approach very compelling. However the question of how to best design an accelerator for a given CNN has not been answered yet, even on a very fundamental level. This paper addresses that challenge, by providing a novel framework that can universally and accurately evaluate and explore various architectural choices for CNN accelerators on FPGAs. Our exploration framework is more extensive than that of any previous work in terms of the design space, and takes into account various FPGA resources to maximize performance including DSP resources, on-chip memory, and off-chip memory bandwidth. Our experimental results using some of the largest CNN models including one that has 16 convolutional layers demonstrate the efficacy of our framework, as well as the need for such a high-level architecture exploration approach to find the best architecture for a CNN model.

Download Paper (PDF; Only available from the DATE venue WiFi)
09:459.4.4A SLACK-BASED APPROACH TO EFFICIENTLY DEPLOY RADIX 8 BOOTH MULTIPLIERS
Speaker:
Alberto Antonio Del Barrio, Universidad Complutense de Madrid, ES
Authors:
Alberto Antonio Del Barrio Garcia and Hermida Roman, Complutense University of Madrid, ES
Abstract
In 1951 A. Booth published his algorithm to efficiently multiply signed numbers. Since the appearance of such algorithm, it has been widely accepted that radix 4-based Booth multipliers are the most efficient. They allow the height of the multiplier to be halved, at the expense of a simple recoding that consists of just shifts and negations. Theoretically, higher radix should produce even larger reductions, especially in terms of area and power, but the recoding process is much more complex. Notably, in the case of radix 8 it is necessary to compute 3X, X being the multiplicand. In order to avoid the penalty due to this calculation, we propose decoupling it from the product and considering 3X as an extra operation within the application's Dataflow Graph (DFG). Experiments show that typically there is enough slack in the DFGs to do this without degrading the performance of the circuit, which permits the efficient deployment of radix 8 multipliers that do not calculate the 3X multiple. Results show that our approach is 10% and 17% faster than radix 4 and radix 8 Booth based implementations, respectively, and 12% and 10% more energy efficient in terms of Energy Delay Product.

Download Paper (PDF; Only available from the DATE venue WiFi)
10:00IP4-10, 128A SCHEDULABILITY TEST FOR SOFTWARE MIGRATION ON MULTICORE SYSTEMS
Speaker:
Jung-Eun Kim, Department of Computer Science at the University of Illinois at Urbana-Champaign, US
Authors:
Jung-Eun Kim1, Richard Bradford2, Tarek Abdelzaher3 and Lui Sha3
1Department of Computer Science, University of Illinois at Urbana-Champaign, US; 2Rockwell Collins, Cedar Rapids, IA, US; 3University of Illinois, US
Abstract
This paper presents a new schedulability test for safety-critical software undergoing a transition from single-core to multicore systems - a challenge faced by multiple industries today. Our migration model consists of a schedulability test and execution model. Its properties enable us to obtain a utilization bound that places an allowable limit on total task execution times. Evaluation results demonstrate the advantages of our scheduling model over competing resource partitioning approaches, such as Periodic Server and TDMA.

Download Paper (PDF; Only available from the DATE venue WiFi)
10:00End of session
Coffee Break in Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Tuesday, March 28, 2017

  • Coffee Break 10:30 - 11:30
  • Coffee Break 16:00 - 17:00

Wednesday, March 29, 2017

  • Coffee Break 10:00 - 11:00
  • Coffee Break 16:00 - 17:00

Thursday, March 30, 2017

  • Coffee Break 10:00 - 11:00
  • Coffee Break 15:30 - 16:00