HEAL: Performance Troubleshooting Deep inside Data Center Hosts

Yicheng Pan, Yang Zhang, Tingzhu Bi, Linlin Han, Yu Zhang, Meng Ma, Xiangzhuang Shen, Xinrui Jiang, Feng Wang, Xian Liu, Ping Wang

Research output: Contribution to journalArticlepeer-review

Abstract

This study demonstrates the salient facts and challenges of host failure operations in hyperscale data centers. A host incident can involve hundreds of distinct host-level metrics. The faulting mechanism inside the host connects these heterogeneous metrics through direct and indirect correlation, making it extremely difficult to sort out the propagation procedures and the root cause from these intertwined indicators. To deeply understand the failure mechanism inside the host, we develop HEAL-a novel host metrics analysis toolkit. HEAL discovers dynamic causality in sparse heterogeneous host metrics by combining the strengths of both time series and random variable analysis. It also extracts causal directional hints from causality's asymmetry and historical knowledge, which finally help HEAL produce accurate results given undesirable inputs. Evaluations in our production environment verify that HEAL provides significantly better result accuracy and full-process interpretability than the SOTA baselines. With these advantages, HEAL successfully serves our data center and worldwide product operations and impressively contributes to many other workflows.

Original languageEnglish
Pages (from-to)41-42
Number of pages2
JournalPerformance Evaluation Review
Volume52
Issue number1
DOIs
StatePublished - 10 Jun 2024
Externally publishedYes

Keywords

  • dynamic causality
  • granger causality analysis
  • host machines
  • monitoring metrics
  • root cause analysis

Fingerprint

Dive into the research topics of 'HEAL: Performance Troubleshooting Deep inside Data Center Hosts'. Together they form a unique fingerprint.

Cite this