Prediction of Thermal Hazards in a Real Datacenter Room using Temporal Convolutional Networks

Mohsen Seyedkazemi Ardebilia, Marcello Zanghierib, Alessio Burrelloc, Francesco Beneventid, Andrea Acquavivae, Luca Benini1,2,a,b and Andrea Bartolinif
1Department of Electrical, Electronic and Information Engineering (DEI) “Guglielmo Marconi” Università degli Studi di Bologna, Bologna, Italy
amohsen.seyedkazemi@unibo.it
bmarcello.zanghieri2@unibo.it
calessio.burrello@unibo.it
dfrancesco.beneventi@unibo.it
eandrea.acquaviva@unibo.it
fa.bartolini@unibo.it
2Integrated Systems Laboratory, ETH Zurich, Switzerland
albenini@iis.ee.ethz.ch
bluca.benini@unibo.it

ABSTRACT


Datacenters play a vital role in today’s society. At large, a datacenter room is a complex controlled environment composed of thousands of computing nodes, which consume kW of power. To dissipate the power, forced air/liquid flow is employed, with a cost of millions of euros per year. Reducing this cost involves using free-cooling and average case design, which can create a cooling shortage and thermal hazards. When a thermal hazard happens, the system administrators and the facility manager must stop the production to avoid IT equipment damage and wear-out. In this paper, we study the thermal hazards signatures on a Tier-0 datacenter room’s monitored data during a full year of production. We define a set of rules for detecting the thermal hazards based on the inlet and outlet temperature of all nodes of a room. We then propose a custom Temporal Convolutional Network (TCN) to predict the hazards in advance. The results show that our TCN can predict the thermal hazards with an F1-score of 0.98 for a randomly sampled test set. When causality is enforced between the training and validation set the F1-score drops to 0.74, demanding for an in-place online re-training of the network, which motivates further research in this context.

Keywords: HPC, Thermal Hazard, Predictive Model, Thermal Anomaly Detection, Temporal Convolutional Network.



Full Text (PDF)