Performance Analysis and Auto-tuning for SPARK in-memory Analytics
Dimitra Nikitopoulou1,a, Dimosthenis Masouros1,b, Sotirios Xydis1,2,c,d and Dimitrios Soudris1,e
1Microprocessors and Digital Systems Laboratory, ECE , National Technical University of Athens, Greece
2Department of Informatics and Telematics (DIT), Harokopio University of Athens (HUA), Greece
adiminiki@microlab.ntua.gr
bdmasouros@microlab.ntua.gr
csxydis@microlab.ntua.gr
dsxydis@hua.gr
edsoudris@microlab.ntua.gr
ABSTRACT
Recently the Apache Spark in-memory computing framework has gained a lot of attention, due to its increased performance on large-scale data processing. Although Spark is highly configurable, its manually tuning is time consuming, due to the high-dimensional configuration space. Prior research has emerged frameworks able to analyze and model the performance of Spark applications, however they either rely on empirical selection of important parameters or/and follow a pure applicationspecific modeling approach. In this paper, we propose an end-toend performance auto-tuning framework for Spark in-memory analytics. By adopting statistical hypothesis testing techniques, we manage to extract the higher order effects among differing parameters and their significance in performance optimization. In addition, we propose a new systematic meta-model driven approach utilizing cluster-, rather than application-wise performance modeling for traversing the configuration search space.We evaluate our approach using real scale analytic benchmarks from HiBench suite and show that the proposed framework achieves an average performance gain of ×3.07 for known and ×2.01 for unknown applications, compared to the default configuration.
Keywords: In-Memory Computing, Apache Spark, Auto- Tuning, Performance Analysis, Machine Learning, Optimization.