Continuous Learning of HPC Infrastructure Models using Big Data Analytics and In-Memory processing Tools

Francesco Beneventi1,a, Andrea Bartolini1,2,b,d, Carlo Cavazzoni3 and Luca Benini1,2,c,e
1Department of Electrical, Electronic and Information Engineering (DEI), University of Bologna, Italy.
afrancesco.beneventi@unibo.it
ba.bartolini@unibo.it
cluca.benini@unibo.it
2Integrated Systems Laboratory, ETH Zurich, Switzerland.
dbarandre@iis.ee.ethz.ch
elbenini@iis.ee.ethz.ch
3Cineca, Italy.
c.cavazzoni@cineca.it

ABSTRACT


Exascale computing represents the next leap in the HPC race. Reaching this level of performance is subject to several engineering challenges such as energy consumption, equipmentcooling, reliability and massive parallelism. Model-based optimization is an essential tool in the design process and control of energy efficient, reliable and thermally constrained systems. However, in the Exascale domain, model learning techniques tailored to the specific supercomputer require real measurements and must therefore handle and analyze a massive amount of data coming from the HPC monitoring infrastructure. This becomes rapidly a ``big data'' scale problem. The common approach where measurements are first stored in large databases and then processed is no more affordable due to the increasingly storage costs and lack of real-time support. Nowadays instead, cloudbased machine learning techniques aim to build on-line models using real-time approaches such as ``stream processing'' and ``inmemory'' computing, that avoid storage costs and enable fastdata processing. Moreover, the fast delivery and adaptation of the models to the quick data variations, make the decision stage of the optimization loop more effective and reliable. In this paper we leverage scalable, lightweight and flexible IoT technologies, such as the MQTT protocol, to build a highly scalable HPC monitoring infrastructure able to handle the massive sensor data produced by next-gen HPC components. We then show how state-of-the art tools for big data computing and analysis, such as Apache Spark, can be used to manage the huge amount of data delivered by the monitoring layer and to build adaptive models in real-time using on-line machine learning techniques.



Full Text (PDF)