FaultInsight: Interpreting Hyperscale Data Center Host Faults

Tingzhu Bi, Zhang Yang, Yicheng Pan, Yu Zhang, Meng Ma, Xinrui Jiang, Linlin Han, Feng Wang, Xian Liu, Ping Wang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Operating and maintaining hyperscale data centers involving millions of service hosts has been an extremely intricate task to tackle for top Internet companies. Incessant system failures cost operators countless hours of browsing through performance metrics to diagnose the underlying root cause to prevent the recurrence. Although many state-of-the-art (SOTA) methods have used time-series causal discovery to construct causal relationships among anomalous metrics, they only focus on homogeneous service-level performance metrics and fail to yield useful insights on heterogeneous host-level metrics. To address the challenge, this study presents FaultInsight, a highly interpretable deep causal host fault diagnosing framework that offers diagnostic insights from various perspectives to reduce human effort in troubleshooting. We evaluate FaultInsight using dozens of incidents collected from our production environment. FaultInsight provides markedly better root cause identification accuracy than SOTA baselines in our incident dataset. It also shows outstanding advantages in terms of deployability in real production systems. Our engineers are deeply impressed by FaultInsight's ability to interpret incidents from multiple perspectives, helping them quickly understand the mechanism behind the faults.

Original languageEnglish
Title of host publicationKDD 2024 - Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages141-152
Number of pages12
ISBN (Electronic)9798400704901
DOIs
StatePublished - 25 Aug 2024
Externally publishedYes
Event30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2024 - Barcelona, Spain
Duration: 25 Aug 202429 Aug 2024

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
ISSN (Print)2154-817X

Conference

Conference30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2024
Country/TerritorySpain
CityBarcelona
Period25/08/2429/08/24

Keywords

  • causal discovery
  • data center
  • fault diagnosis

Fingerprint

Dive into the research topics of 'FaultInsight: Interpreting Hyperscale Data Center Host Faults'. Together they form a unique fingerprint.

Cite this