4.2 Domain Specific Design Methodologies

Time	Label	Presentation Title Authors
17:00	4.2.1	APPROXIMATE COMPUTING FOR BIOMETRIC SECURITY SYSTEMS: A CASE STUDY ON IRIS SCANNING Speaker: Sherief Reda, Brown University, US Authors: Soheil Hashemi, Hokchhay Tann, Francesco Buttafuoco and Sherief Reda, Brown University, US Abstract Exploiting the error resilience of emerging data-rich applications, approximate computing promotes the introduction of small amount of inaccuracy into computing systems to achieve significant reduction in computing resources such as power, design area, runtime or energy. Successful applications for approximate computing have been demonstrated in the areas of machine learning, image processing and computer vision. In this paper we make the case for a new direction for approximate computing in the field of biometric security with a comprehensive case study of iris scanning. We devise an end-to-end flow from an input camera to the final iris encoding that produces sufficiently accurate final results despite relying on intermediate approximate computational steps. Unlike previous methods which evaluated approximate computing techniques on individual algorithms, our flow consists of a complex SW/HW pipeline of four major algorithms that eventually compute the iris encoding from input live camera feeds. In our flow, we identify overall eight approximation knobs at both the algorithmic and hardware levels to trade-off accuracy with runtime. To identify the optimal values for these knobs, we devise a novel design space exploration technique based on reinforcement learning with a recurrent neural network agent. Finally, we fully implement and test our proposed methodologies using both benchmark dataset images and live images from a camera using an FPGA-based SoC. We show that we are able to reduce the runtime of the system by 48x on top of an already HW accelerated design, while meeting industry-standard accuracy requirements for iris scanning systems. Download Paper (PDF; Only available from the DATE venue WiFi)
17:30	4.2.2	FLASH READ DISTURB MANAGEMENT USING ADAPTIVE CELL BIT-DENSITY WITH IN-PLACE REPROGRAMMING Speaker: Sung-Ming Wu, National Chiao-Tung University, TW Authors: Tai-Chou Wu, Yu-Ping Ma and Li-Pin Chang, National Chiao-Tung University, TW Abstract Read disturbance is a circuit-level noise induced by flash read operations. Read refreshing employs data migration to prevent read disturbance from corrupting useful data. However, it costs frequent block erasure under read-intensive workloads. Inspired by software-controlled cell bit-density, we propose to reserve selected threshold voltage levels as guard levels to extend the tolerance of read disturbance. Blocks with guard levels have a low cell bit-density, but they can store frequently read data without frequent read refreshing. We further propose to convert a high-density block into a low-density one using in-place reprogramming to reduce the need for data migration. Our approach reduced the number of blocks erased due to read refreshing by up to 85% and the average read response time by up to 22%. Download Paper (PDF; Only available from the DATE venue WiFi)
18:00	4.2.3	HTF-MPR: A HETEROGENEOUS TENSORFLOW MAPPER TARGETING PERFORMANCE USING GENETIC ALGORITHMS AND GRADIENT BOOSTING REGRESSORS Speaker: Nader Bagherzadeh, University of California, Irvine, US Authors: Ahmad Albaqsami, Maryam S. Hosseini and Nader Bagherzadeh, University of California, Irvine, US Abstract TensorFlow is a library developed by Google to implement Artificial Neural Networks using computational dataflow graphs. The neural network has many iterations during training. A distributed, parallel environment is ideal to speedup learning. Parallelism requires proper mapping of devices to TensorFlow operations. We developed HTF-MPR framework for that reason. HTF-MPR utilizes a genetic algorithm approach to search for the best mapping that outperforms the default Tensorflow mapper. By using Gradient Boosting Regressors to create the fitness predictive model, the search space is expanded which increases the chances of finding a solution mapping. Our results on well-known neural network benchmarks, such as ALEXNET, MNIST softmax classifier, and VGG-16, show an overall speedup in the training stage by 1.18, 3.33, and 1.13, respectively. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30	IP1-10, 363	AN EFFICIENT RESOURCE-OPTIMIZED LEARNING PREFETCHER FOR SOLID STATE DRIVES Speaker: Rui Xu, University of Science and Technology of China, CN Authors: Rui Xu, Xi Jin, Linfeng Tao, Shuaizhi Guo, Zikun Xiang and Teng Tian, Strongly-Coupled Quantum Matter Physics, Chinese Academy of Sciences, School of Physical Sciences, University of Science and Technology of China, Hefei, Anhui, China, CN Abstract In recent years, solid-state drives (SSDs) have been widely deployed in modern storage systems. To increase the performance of SSDs, prefetchers for SSDs have been designed both at operating system (OS) layer and flash translation layer (FTL). Prefetchers in FTL have many advantages like OS-independence, easy-using, and compatibility. However, due to the limitation of computing capabilities and memory resources, existing prefetchers in FTL merely employ simple sequential prefetching which may incur high penalty cost for I/O access stream with complex patterns. In this paper, an efficient learning prefetcher implemented in FTL is proposed. Considering the resource limitation of SSDs, a learning algorithm based on Markov chains is employed and optimized so that high hit ratio and low penalty cost can be achieved even for complex access patterns. To validate our design, a simulator with the prefetcher is designed and implemented based on Flashsim. The TPC-H benchmark and an application launch trace are tested on the simulator. According to experimental results of the TPC-H benchmark, more than 90% of memory cost can be saved in comparison with a previous design at OS layer. The hit ratio can be increased by 24.1% and the number of times of misprefetching can be reduced by 95.8% in comparison with the simple sequential prefetching strategy. Download Paper (PDF; Only available from the DATE venue WiFi)
18:30		End of session Exhibition Reception in Exhibition Area The Exhibition Reception will take place on Tuesday in the exhibition area, where free drinks for all conference delegates and exhibition visitors will be offered. All exhibitors are welcome to also provide drinks and snacks for the attendees.

Time

Label

Presentation Title
Authors

17:00

4.2.1

APPROXIMATE COMPUTING FOR BIOMETRIC SECURITY SYSTEMS: A CASE STUDY ON IRIS SCANNING
Speaker:
Sherief Reda, Brown University, US
Authors:
Soheil Hashemi, Hokchhay Tann, Francesco Buttafuoco and Sherief Reda, Brown University, US
Abstract
Exploiting the error resilience of emerging data-rich applications, approximate computing promotes the introduction of small amount of inaccuracy into computing systems to achieve significant reduction in computing resources such as power, design area, runtime or energy. Successful applications for approximate computing have been demonstrated in the areas of machine learning, image processing and computer vision. In this paper we make the case for a new direction for approximate computing in the field of biometric security with a comprehensive case study of iris scanning. We devise an end-to-end flow from an input camera to the final iris encoding that produces sufficiently accurate final results despite relying on intermediate approximate computational steps. Unlike previous methods which evaluated approximate computing techniques on individual algorithms, our flow consists of a complex SW/HW pipeline of four major algorithms that eventually compute the iris encoding from input live camera feeds. In our flow, we identify overall eight approximation knobs at both the algorithmic and hardware levels to trade-off accuracy with runtime. To identify the optimal values for these knobs, we devise a novel design space exploration technique based on reinforcement learning with a recurrent neural network agent. Finally, we fully implement and test our proposed methodologies using both benchmark dataset images and live images from a camera using an FPGA-based SoC. We show that we are able to reduce the runtime of the system by 48x on top of an already HW accelerated design, while meeting industry-standard accuracy requirements for iris scanning systems.
Download Paper (PDF; Only available from the DATE venue WiFi)

17:30

4.2.2

FLASH READ DISTURB MANAGEMENT USING ADAPTIVE CELL BIT-DENSITY WITH IN-PLACE REPROGRAMMING
Speaker:
Sung-Ming Wu, National Chiao-Tung University, TW
Authors:
Tai-Chou Wu, Yu-Ping Ma and Li-Pin Chang, National Chiao-Tung University, TW
Abstract
Read disturbance is a circuit-level noise induced by flash read operations. Read refreshing employs data migration to prevent read disturbance from corrupting useful data. However, it costs frequent block erasure under read-intensive workloads. Inspired by software-controlled cell bit-density, we propose to reserve selected threshold voltage levels as guard levels to extend the tolerance of read disturbance. Blocks with guard levels have a low cell bit-density, but they can store frequently read data without frequent read refreshing. We further propose to convert a high-density block into a low-density one using in-place reprogramming to reduce the need for data migration. Our approach reduced the number of blocks erased due to read refreshing by up to 85% and the average read response time by up to 22%.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:00

4.2.3

HTF-MPR: A HETEROGENEOUS TENSORFLOW MAPPER TARGETING PERFORMANCE USING GENETIC ALGORITHMS AND GRADIENT BOOSTING REGRESSORS
Speaker:
Nader Bagherzadeh, University of California, Irvine, US
Authors:
Ahmad Albaqsami, Maryam S. Hosseini and Nader Bagherzadeh, University of California, Irvine, US
Abstract
TensorFlow is a library developed by Google to implement Artificial Neural Networks using computational dataflow graphs. The neural network has many iterations during training. A distributed, parallel environment is ideal to speedup learning. Parallelism requires proper mapping of devices to TensorFlow operations. We developed HTF-MPR framework for that reason. HTF-MPR utilizes a genetic algorithm approach to search for the best mapping that outperforms the default Tensorflow mapper. By using Gradient Boosting Regressors to create the fitness predictive model, the search space is expanded which increases the chances of finding a solution mapping. Our results on well-known neural network benchmarks, such as ALEXNET, MNIST softmax classifier, and VGG-16, show an overall speedup in the training stage by 1.18, 3.33, and 1.13, respectively.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

IP1-10, 363

AN EFFICIENT RESOURCE-OPTIMIZED LEARNING PREFETCHER FOR SOLID STATE DRIVES
Speaker:
Rui Xu, University of Science and Technology of China, CN
Authors:
Rui Xu, Xi Jin, Linfeng Tao, Shuaizhi Guo, Zikun Xiang and Teng Tian, Strongly-Coupled Quantum Matter Physics, Chinese Academy of Sciences, School of Physical Sciences, University of Science and Technology of China, Hefei, Anhui, China, CN
Abstract
In recent years, solid-state drives (SSDs) have been widely deployed in modern storage systems. To increase the performance of SSDs, prefetchers for SSDs have been designed both at operating system (OS) layer and flash translation layer (FTL). Prefetchers in FTL have many advantages like OS-independence, easy-using, and compatibility. However, due to the limitation of computing capabilities and memory resources, existing prefetchers in FTL merely employ simple sequential prefetching which may incur high penalty cost for I/O access stream with complex patterns. In this paper, an efficient learning prefetcher implemented in FTL is proposed. Considering the resource limitation of SSDs, a learning algorithm based on Markov chains is employed and optimized so that high hit ratio and low penalty cost can be achieved even for complex access patterns. To validate our design, a simulator with the prefetcher is designed and implemented based on Flashsim. The TPC-H benchmark and an application launch trace are tested on the simulator. According to experimental results of the TPC-H benchmark, more than 90% of memory cost can be saved in comparison with a previous design at OS layer. The hit ratio can be increased by 24.1% and the number of times of misprefetching can be reduced by 95.8% in comparison with the simple sequential prefetching strategy.
Download Paper (PDF; Only available from the DATE venue WiFi)

18:30

End of session
Exhibition Reception in Exhibition Area
The Exhibition Reception will take place on Tuesday in the exhibition area, where free drinks for all conference delegates and exhibition visitors will be offered. All exhibitors are welcome to also provide drinks and snacks for the attendees.