4.2 Reconfigurable Architecture and Tools

Time	Label	Presentation Title Authors
17:00	4.2.1	CONTEXT-MEMORY AWARE MAPPING FOR ENERGY EFFICIENT ACCELERATION WITH CGRAS Speaker: Satyajit Das, Univ. Bretagne-Sud, CNRS UMR 6285, Lab-STICC, FR Authors: Satyajit Das, Kevin Martin and Philippe Coussy, Université de Bretagne-Sud, FR Abstract Coarse Grained Reconfigurable Arrays (CGRAs) are emerging as low power computing alternative providing a high grade of acceleration. However, the area and energy efficiency of these devices are bottlenecked by the configuration/context memory when they are made autonomous and loosely coupled with CPUs. The size of these instruction memories is of prime importance due to their high area and impact on the power consumption. For instance, a 64-word instruction memory typically represents 40% of a processing element area. In this context, since traditional mapping approaches do not take the size of the context memory into account, CGRAs often become oversized which strongly degrade their performance and interest. In this paper, we propose a context memory aware mapping for CGRAs to achieve better area and energy efficiency. This paper motivates the need of constraining the size of the context memory inside the processing element (PE) for ultra low power acceleration. It also describes the mapping approach which tries to find at least one mapping solution for a given set of constraints defined by the context memories of the PEs. Experiments show that our proposed solution achieves an average of 2.3× energy gain (with a maximum of 3.1× and a minimum of 1.4×) compared to the mapping approach without the memory constraints, while using 2× less instruction memory. When compared to the CPU, the proposed mapping achieves an average of 14× (with a maximum of 23× and minimum of 5×) energy gain. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30	4.2.2	THERMAL-AWARE DESIGN AND FLOW FOR FPGA PERFORMANCE IMPROVEMENT Speaker: Tajana Rosing, University of California, San Diego, US Authors: Behnam Khaleghi and Tajana Rosing, University of California, San Diego, US Abstract To ensure reliable operation of circuits under elevated temperatures, designers are obliged to put a pessimistic timing margin proportional to the worst-case temperature (T worst ), which incurs significant performance overhead. The problem is exacerbated in deep-CMOS technologies with increased leakage power, particularly in Field-Programmable Gate Arrays (FPGAs) that comprise an abundance of leaky resources. We propose a two-fold approach to tackle the problem in FPGAs. For this end, we first obtain the performance and power characteristics of FPGA resources in a temperature range. Having the temperature-performance correlation of resources together with the estimated thermal distribution of applications makes it feasible to apply minimal, yet sufficient, timing margin. Second, we show how optimizing an FPGA device for a specific thermal corner affects its performance in the operating temperature range. This emphasizes the need for optimizing the device according to the target (range of) temperature. Building upon this observation, we propose thermal-aware optimization of FPGA architecture for foreknown field conditions. We performed a comprehensive set of experiments to implement and examine the proposed techniques. The experimental results reveal that thermal-aware timing on FPGAs yields up to 36.5% performance improvement. Optimizing the architecture further boosts the performance by 6.7%. Download Paper (PDF; Only available from the DATE venue WiFi)
18:00	4.2.3	FIXER: FLOW INTEGRITY EXTENSIONS FOR EMBEDDED RISC-V Speaker: Swaroop Ghosh, The Pennsylvania State University, US Authors: Asmit De, Aditya Basu, Swaroop Ghosh and Trent Jaeger, Pennsylvania State University, US Abstract With the recent proliferation of Internet of Things (IoT) and embedded devices, there is a growing need to develop a security framework to protect such devices. RISC-V is a promising open source architecture that targets low-power embedded devices and SoCs. However, there is a dearth of practical and low-overhead security solutions in the RISC-V architecture. Programs compiled using RISC-V toolchains are still vulnerable to code injection and code reuse attacks such as buffer overflow and return-oriented programming (ROP). In this paper, we propose FIXER, a hardware implemented security extension to RISC-V that provides a defense mechanism against such attacks. FIXER enforces fine-grained control-flow integrity (CFI) of running programs on backward edges (returns) and forward edges (calls) without requiring any architectural modifications to the RISC-V processor core. We implement FIXER on RocketChip, a RISC-V SoC platform, by leveraging the integrated Rocket Custom Coprocessor (RoCC) to detect and prevent attacks. Compared to existing software based solutions, FIXER reduces energy overhead by 60% at minimal execution time (1.5%) and area (2.9%) overheads. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30	IP2-1, 803	TRANSREC: IMPROVING ADAPTABILITY IN SINGLE-ISA HETEROGENEOUS SYSTEMS WITH TRANSPARENT AND RECONFIGURABLE ACCELERATION Speaker: Marcelo Brandalero, Universidade Federal do Rio Grande do Sul (UFRGS), BR Authors: Marcelo Brandalero¹, Muhammad Shafique², Luigi Carro¹ and Antonio Carlos Schneider Beck¹ ¹UFRGS - Universidade Federal do Rio Grande do Sul, BR; ²Vienna University of Technology (TU Wien), AT Abstract Single-ISA heterogeneous systems, such as ARM's big.LITTLE, use microarchitecturally-different General-Purpose Processor cores to efficiently match the capabilities of the processing resources with applications' performance and energy requirements that change at run time. However, since only a fixed and non-configurable set of cores is available, reaching the best possible match between the available resources and applications' requirements remains a challenge, especially considering the varying and unpredictable workloads. In this work, we propose TransRec, a hardware architecture which improves over these traditional heterogeneous designs. TransRec integrates a shared, transparent (i.e., no need to change application binary) and adaptive accelerator in the form of a Coarse-Grained Reconfigurable Array that can be used by any of the General-Purpose Processor cores for on-demand acceleration. Through evaluations with cycle-accurate gem5 simulations, synthesis of real RISC-V processor designs for a 15nm technology, and considering the effects of Dynamic Voltage and Frequency Scaling, we demonstrate that TransRec provides better performance-energy tradeoffs that are otherwise unachievable with traditional big.LITTLE-like designs. In particular, for less than 40% area overhead, TransRec can improve performance in the low-energy mode (LITTLE) by 2.28x, and can improve both performance and energy efficiency by 1.32x and 1.59x, respectively, in high-performance mode (big). Download Paper (PDF; Only available from the DATE venue WiFi)
18:31	IP2-2, 116	CADE: CONFIGURABLE APPROXIMATE DIVIDER FOR ENERGY EFFICIENCY Speaker: Mohsen Imani, University of California, San Diego, US Authors: Mohsen Imani, Ricardo Garcia, Andrew Huang and Tajana Rosing, University of California San Diego, US Abstract Approximate computing is a promising solution to design faster and more energy efficient systems, which provides an adequate quality for a variety of functions. Division, in particular, floating point division, is one of the most important operations in multimedia applications, which has been implemented less in hardware due to its significant cost and complexity. In this paper, we proposed CADE, a Configurable Approximate Divider which performs floating point division operation with a runtime controllable accuracy. The approximation of the CADE is accomplished by removing the costly division operation and replacing it with a subtraction of the input operands mantissa. To increase the level of accuracy, CADE analyses the first N bits (called tuning bits) of both input operands mantissa to estimate the division error. If CADE determines that the first approximation is unacceptable, a pre-computed value is retrieved from memory and subtracted from the first approximation mantissa. At runtime, CADE can provide a higher accuracy by increasing the number of tuning bits. The proposed CADE was integrated on the AMD GPU architecture. Our evaluation shows that CADE is at least 4.1× more energy efficient, 1.5× faster, and 1.7× higher area efficient as compared to state-of-the-art approximate dividers while providing 25% lower error rate. In addition, CADE gives a new knob to GPU in order to configure the level of approximation at runtime depending on the application/user accuracy requirement. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30		End of session Exhibition Reception in Exhibition Area The Exhibition Reception will take place on Tuesday in the exhibition area, where free drinks for all conference delegates and exhibition visitors will be offered. All exhibitors are welcome to also provide drinks and snacks for the attendees.

Time

Label

Presentation Title
Authors

17:00

4.2.1

CONTEXT-MEMORY AWARE MAPPING FOR ENERGY EFFICIENT ACCELERATION WITH CGRAS
Speaker:
Satyajit Das, Univ. Bretagne-Sud, CNRS UMR 6285, Lab-STICC, FR
Authors:
Satyajit Das, Kevin Martin and Philippe Coussy, Université de Bretagne-Sud, FR
Abstract
Coarse Grained Reconfigurable Arrays (CGRAs) are emerging as low power computing alternative providing a high grade of acceleration. However, the area and energy efficiency of these devices are bottlenecked by the configuration/context memory when they are made autonomous and loosely coupled with CPUs. The size of these instruction memories is of prime importance due to their high area and impact on the power consumption. For instance, a 64-word instruction memory typically represents 40% of a processing element area. In this context, since traditional mapping approaches do not take the size of the context memory into account, CGRAs often become oversized which strongly degrade their performance and interest. In this paper, we propose a context memory aware mapping for CGRAs to achieve better area and energy efficiency. This paper motivates the need of constraining the size of the context memory inside the processing element (PE) for ultra low power acceleration. It also describes the mapping approach which tries to find at least one mapping solution for a given set of constraints defined by the context memories of the PEs. Experiments show that our proposed solution achieves an average of 2.3× energy gain (with a maximum of 3.1× and a minimum of 1.4×) compared to the mapping approach without the memory constraints, while using 2× less instruction memory. When compared to the CPU, the proposed mapping achieves an average of 14× (with a maximum of 23× and minimum of 5×) energy gain.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

4.2.2

THERMAL-AWARE DESIGN AND FLOW FOR FPGA PERFORMANCE IMPROVEMENT
Speaker:
Tajana Rosing, University of California, San Diego, US
Authors:
Behnam Khaleghi and Tajana Rosing, University of California, San Diego, US
Abstract
To ensure reliable operation of circuits under elevated temperatures, designers are obliged to put a pessimistic timing margin proportional to the worst-case temperature (T worst ), which incurs significant performance overhead. The problem is exacerbated in deep-CMOS technologies with increased leakage power, particularly in Field-Programmable Gate Arrays (FPGAs) that comprise an abundance of leaky resources. We propose a two-fold approach to tackle the problem in FPGAs. For this end, we first obtain the performance and power characteristics of FPGA resources in a temperature range. Having the temperature-performance correlation of resources together with the estimated thermal distribution of applications makes it feasible to apply minimal, yet sufficient, timing margin. Second, we show how optimizing an FPGA device for a specific thermal corner affects its performance in the operating temperature range. This emphasizes the need for optimizing the device according to the target (range of) temperature. Building upon this observation, we propose thermal-aware optimization of FPGA architecture for foreknown field conditions. We performed a comprehensive set of experiments to implement and examine the proposed techniques. The experimental results reveal that thermal-aware timing on FPGAs yields up to 36.5% performance improvement. Optimizing the architecture further boosts the performance by 6.7%.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:00

4.2.3

FIXER: FLOW INTEGRITY EXTENSIONS FOR EMBEDDED RISC-V
Speaker:
Swaroop Ghosh, The Pennsylvania State University, US
Authors:
Asmit De, Aditya Basu, Swaroop Ghosh and Trent Jaeger, Pennsylvania State University, US
Abstract
With the recent proliferation of Internet of Things (IoT) and embedded devices, there is a growing need to develop a security framework to protect such devices. RISC-V is a promising open source architecture that targets low-power embedded devices and SoCs. However, there is a dearth of practical and low-overhead security solutions in the RISC-V architecture. Programs compiled using RISC-V toolchains are still vulnerable to code injection and code reuse attacks such as buffer overflow and return-oriented programming (ROP). In this paper, we propose FIXER, a hardware implemented security extension to RISC-V that provides a defense mechanism against such attacks. FIXER enforces fine-grained control-flow integrity (CFI) of running programs on backward edges (returns) and forward edges (calls) without requiring any architectural modifications to the RISC-V processor core. We implement FIXER on RocketChip, a RISC-V SoC platform, by leveraging the integrated Rocket Custom Coprocessor (RoCC) to detect and prevent attacks. Compared to existing software based solutions, FIXER reduces energy overhead by 60% at minimal execution time (1.5%) and area (2.9%) overheads.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

IP2-1, 803

TRANSREC: IMPROVING ADAPTABILITY IN SINGLE-ISA HETEROGENEOUS SYSTEMS WITH TRANSPARENT AND RECONFIGURABLE ACCELERATION
Speaker:
Marcelo Brandalero, Universidade Federal do Rio Grande do Sul (UFRGS), BR
Authors:
Marcelo Brandalero¹, Muhammad Shafique², Luigi Carro¹ and Antonio Carlos Schneider Beck¹
¹UFRGS - Universidade Federal do Rio Grande do Sul, BR; ²Vienna University of Technology (TU Wien), AT
Abstract
Single-ISA heterogeneous systems, such as ARM's big.LITTLE, use microarchitecturally-different General-Purpose Processor cores to efficiently match the capabilities of the processing resources with applications' performance and energy requirements that change at run time. However, since only a fixed and non-configurable set of cores is available, reaching the best possible match between the available resources and applications' requirements remains a challenge, especially considering the varying and unpredictable workloads. In this work, we propose TransRec, a hardware architecture which improves over these traditional heterogeneous designs. TransRec integrates a shared, transparent (i.e., no need to change application binary) and adaptive accelerator in the form of a Coarse-Grained Reconfigurable Array that can be used by any of the General-Purpose Processor cores for on-demand acceleration. Through evaluations with cycle-accurate gem5 simulations, synthesis of real RISC-V processor designs for a 15nm technology, and considering the effects of Dynamic Voltage and Frequency Scaling, we demonstrate that TransRec provides better performance-energy tradeoffs that are otherwise unachievable with traditional big.LITTLE-like designs. In particular, for less than 40% area overhead, TransRec can improve performance in the low-energy mode (LITTLE) by 2.28x, and can improve both performance and energy efficiency by 1.32x and 1.59x, respectively, in high-performance mode (big).
Download Paper (PDF; Only available from the DATE venue WiFi)

18:31

IP2-2, 116

CADE: CONFIGURABLE APPROXIMATE DIVIDER FOR ENERGY EFFICIENCY
Speaker:
Mohsen Imani, University of California, San Diego, US
Authors:
Mohsen Imani, Ricardo Garcia, Andrew Huang and Tajana Rosing, University of California San Diego, US
Abstract
Approximate computing is a promising solution to design faster and more energy efficient systems, which provides an adequate quality for a variety of functions. Division, in particular, floating point division, is one of the most important operations in multimedia applications, which has been implemented less in hardware due to its significant cost and complexity. In this paper, we proposed CADE, a Configurable Approximate Divider which performs floating point division operation with a runtime controllable accuracy. The approximation of the CADE is accomplished by removing the costly division operation and replacing it with a subtraction of the input operands mantissa. To increase the level of accuracy, CADE analyses the first N bits (called tuning bits) of both input operands mantissa to estimate the division error. If CADE determines that the first approximation is unacceptable, a pre-computed value is retrieved from memory and subtracted from the first approximation mantissa. At runtime, CADE can provide a higher accuracy by increasing the number of tuning bits. The proposed CADE was integrated on the AMD GPU architecture. Our evaluation shows that CADE is at least 4.1× more energy efficient, 1.5× faster, and 1.7× higher area efficient as compared to state-of-the-art approximate dividers while providing 25% lower error rate. In addition, CADE gives a new knob to GPU in order to configure the level of approximation at runtime depending on the application/user accuracy requirement.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

End of session
Exhibition Reception in Exhibition Area

The Exhibition Reception will take place on Tuesday in the exhibition area, where free drinks for all conference delegates and exhibition visitors will be offered. All exhibitors are welcome to also provide drinks and snacks for the attendees.