5.7 Software-centric techniques for embedded systems

Printer-friendly version PDF version

Date: Wednesday 21 March 2018
Time: 08:30 - 10:00
Location / Room: Konf. 5

Chair:
Marc Geilen, Eindhoven University of Technology, NL

Co-Chair:
Daniel Ziener, University of Twente, NL

Modern heterogeneous architectures pose new challenges for energy-efficient embedded realizations. The talks in this session address these challenges using software techniques, such as approximate computing, task scheduling, and memory and power-management.

TimeLabelPresentation Title
Authors
08:305.7.1HEPREM: ENABLING PREDICTABLE GPU EXECUTION ON HETEROGENEOUS SOC
Speaker:
Björn Forsberg, ETH Zürich, CH
Authors:
Björn Forsberg1, Luca Benini2 and Andrea Marongiu3
1ETH Zürich, CH; 2Università di Bologna, IT; 3IIS, ETH Zurich, CH
Abstract
Heterogeneous systems-on-a-chip are increasingly embracing shared memory designs, in which a single DRAM is used for both the main CPU and an integrated GPU. This architectural paradigm reduces the overheads associated with data movements and simplifies programmability. However, the deployment of real-time workloads on such architectures is troublesome, as memory contention significantly increases execution time of tasks and the pessimism in worst-case execution time (WCET) estimates. The Predictable Execution Model (PREM) separates memory and computation phases in real-time codes, then arbitrates memory phases from different tasks such that only one core at a time can access the DRAM. This paper revisits the original PREM proposal in the context of heterogeneous SoCs, proposing a compiler-based approach to make GPU codes PREM-compliant. Starting from high-level specifications of computation offloading, suitable program regions are selected and separated into memory and compute phases. Our experimental results show that the proposed technique is able to reduce the sensitivity of GPU kernels to memory interference to near zero, and achieves up to a 20× reduction in the measured WCET.

Download Paper (PDF; Only available from the DATE venue WiFi)
09:005.7.2CIRCUIT CARVING: A METHODOLOGY FOR THE DESIGN OF APPROXIMATE HARDWARE
Speaker:
Ilaria Scarabottolo, USI Lugano, CH
Authors:
Ilaria Scarabottolo, Giovanni Ansaloni and Laura Pozzi, USI Lugano, CH
Abstract
Systems-on-Chip (SoCs) commonly couple low-power processors and dedicated hardware accelerators, which allow the execution of high-workload and/or timing-critical applications while relying on constrained resources. The functions performed by accelerators are often robust with respect to approximations that, when implemented in HW, can lead to circuits with tangibly lower area and power consumption. Research in approximate computing aims at developing effective strategies to explore the ensuing correctness/efficiency trade-off. In this context, we address the challenge of approximate circuit design in an innovative way, called here Circuit Carving, which consists in identifying the maximum portion of an exact circuit that can be discarded from it, or carved out, to derive an inexact version not exceeding an error threshold. We achieve this goal by proposing an algorithm based on binary tree exploration, bounded by conditions extracted from the circuit topology. Our approach can be applied to any combinatorial circuit, without a-priori knowledge of its functionality. The proposed algorithm allows back-tracking in order to never be trapped in local minima, and identifies the exact influence of each circuit gate on the output correctness, resulting in inexact circuits with higher efficiency and accuracy with respect to state-of-the-art greedy strategies.

Download Paper (PDF; Only available from the DATE venue WiFi)
09:155.7.3ICNN: AN ITERATIVE IMPLEMENTATION OF CONVOLUTIONAL NEURAL NETWORKS TO ENABLE ENERGY AND COMPUTATIONAL COMPLEXITY AWARE DYNAMIC APPROXIMATION
Speaker:
Avesta Sasan, George Mason University, US
Authors:
Katayoun Neshatpour, Farnaz Behnia, Houman Homayoun and Avesta Sasan, George Mason University, US
Abstract
With Convolutional Neural Networks (CNN) becoming more of a commodity in the computer vision field, many have attempted to improve CNN in a bid to achieve better accuracy to a point that CNN accuracies have surpassed that of human's capabilities. However, with deeper networks, the number of computations and consequently the power needed per classification has grown considerably. In this paper, we propose Iterative CNN (ICNN) by reformulating the CNN from a single feed-forward network to a series of sequentially executed smaller networks. Each smaller network processes a small set of sub-sampled input features and enhances the accuracy of the classification. Upon reaching an acceptable classification confidence, further possessing of smaller networks is discarded. The proposed network architecture allows the CNN function to be dynamically approximated by creating the possibility of early termination and performing the classification with far fewer operations compared to a conventional CNN. Initial results show that this iterative approach competes with the original larger networks in terms of accuracy while incurring far less computational complexity by detecting many images in early iterations.

Download Paper (PDF; Only available from the DATE venue WiFi)
09:305.7.4TASK SCHEDULING FOR MANY-CORES WITH S-NUCA CACHES
Speaker:
Anuj Pathania, Karlsruhe Institute of Technology, IN
Authors:
Anuj Pathania and Joerg Henkel, Karlsruhe Institute of Technology, DE
Abstract
A many-core processor may comprise a large number of processing cores on a single chip. The many-core's last-level shared cache can potentially be physically distributed alongside the cores in the form of cache banks connected through a Network on Chip~(NoC). Static Non-Uniform Cache Access (S-NUCA) memory address mapping policy provides a scalable mechanism for providing the cores quick access to the entire last-level cache. By design, S-NUCA introduces a unique topology-based performance heterogeneity and we introduce a scheduler that can exploit it. The proposed scheduler improves performance of the many-core by 9.93% in comparison to a state-of-the-art generic many-core scheduler with minimal run-time overheads.

Download Paper (PDF; Only available from the DATE venue WiFi)
09:455.7.5KVSSD: CLOSE INTEGRATION OF LSM TREES AND FLASH TRANSLATION LAYER FOR WRITE-EFFICIENT KV STORE
Speaker:
Sung-Ming Wu, National Chiao-Tung University, TW
Authors:
Sung-Ming Wu, Kai-Hsiang Lin and Li-Pin Chang, National Chiao-Tung University, TW
Abstract
Log-Structured-Merge (LSM) trees are a write-optimized data structure for lightweight, high-performance Key-Value (KV) store. Solid State Disks (SSDs) provide acceleration of KV operations on LSM trees. However, this hierarchical design involves multiple software layers, including the LSM tree, host file system, and Flash Translation Layer (FTL), causing cascading write amplifications. We propose KVSSD, a close integration of LSM trees and the FTL, to manage write amplifications from different layers. KVSSD exploits the FTL mapping mechanism to implement copy-free compaction of LSM trees, and it enables direct data allocation in flash memory for efficient garbage collection. In our experiments, compared to the hierarchical design, our KVSSD reduced the write amplification by 88% and improved the throughput by 347%.

Download Paper (PDF; Only available from the DATE venue WiFi)
10:00IP2-13, 231(Best Paper Award Candidate)
STREAMFTL: STREAM-LEVEL ADDRESS TRANSLATION SCHEME FOR MEMORY CONSTRAINED FLASH STORAGE
Speaker:
Dongkun Shin, Sungkyunkwan University, KR
Authors:
Hyukjoong Kim, Kyuhwa Han and Dongkun Shin, Sungkyunkwan University, KR
Abstract
Although much research efforts have been devoted to reducing the size of address mapping table which consumes DRAM space in solid state drives (SSDs), most SSDs still use page-level mapping for high performance in their firmware called flash translation layer (FTL). In this paper, we propose a novel FTL scheme, called StreamFTL. In order to reduce the size of the mapping table in SSDs, StreamFTL maintains a mapping entry for each stream, which consists of several logical pages written at contiguous physical pages. Unlike extent, which is used by previous FTL schemes, the logical pages in a stream do not need to be contiguous. We show that StreamFTL can reduce the size of the mapping table by up to 90% compared to page-level mapping scheme.

Download Paper (PDF; Only available from the DATE venue WiFi)
10:01IP2-14, 512ONLINE CONCURRENT WORKLOAD CLASSIFICATION FOR MULTI-CORE ENERGY MANAGEMENT
Speaker:
Karunakar Reddy Basireddy, University of Southampton, GB
Authors:
Karunakar Reddy Basireddy1, Amit Kumar Singh2, Geoff V. Merrett1 and Bashir M. Al-Hashimi1
1University of Southampton, GB; 2University of Essex, GB
Abstract
Modern embedded multi-core processors are organized as clusters of cores, where all cores in each cluster operate at a common Voltage-frequency (V-f ). Such processors often need to execute applications concurrently, exhibiting varying and mixed workloads (e.g. compute- and memory-intensive) depending on the instruction mix and resource sharing. Runtime adaptation is key to achieving energy savings without trading-off application performance with such workload variabilities. In this paper, we propose an online energy management technique that performs concurrent workload classification using the metric Memory Reads Per Instruction (MRPI) and pro-actively selects an appropriate V-f setting through workload prediction. Subsequently, it monitors the workload prediction error and performance loss, quantified by Instructions Per Second (IPS) at runtime and adjusts the chosen V-f to compensate. We validate the proposed technique on an Odroid-XU3 with various combinations of benchmark applications. Results show an improvement in energy efficiency of up to 69% compared to existing approaches.

Download Paper (PDF; Only available from the DATE venue WiFi)
10:00End of session
Coffee Break in Exhibition Area



Coffee Breaks in the Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area (Terrace Level of the ICCD).

Lunch Breaks (Großer Saal + Saal 1)

On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the rooms "Großer Saal" and "Saal 1" (Saal Level of the ICCD) to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area.

Tuesday, March 20, 2018

  • Coffee Break 10:30 - 11:30
  • Lunch Break 13:00 - 14:30
  • Awards Presentation and Keynote Lecture in "Saal 2" 13:50 - 14:20
  • Coffee Break 16:00 - 17:00

Wednesday, March 21, 2018

  • Coffee Break 10:00 - 11:00
  • Lunch Break 12:30 - 14:30
  • Awards Presentation and Keynote Lecture in "Saal 2" 13:30 - 14:20
  • Coffee Break 16:00 - 17:00

Thursday, March 22, 2018

  • Coffee Break 10:00 - 11:00
  • Lunch Break 12:30 - 14:00
  • Keynote Lecture in "Saal 2" 13:20 - 13:50
  • Coffee Break 15:30 - 16:00