Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice Architecture

Meng Ma, Weilan Lin, Disheng Pan, Ping Wang

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

The emergence of microservice architecture in Cloud systems poses a new challenges for the reliability operation and maintenance. Due to numerous services and diverse types of metrics, it is time-consuming and challenging to identify the root cause of anomaly in large-scale microservice architecture. To solve this issue, this article presents a multi-metric and self-adaptive root cause diagnosis framework, named MS-Rank. MS-Rank decomposes the task into four phases: impact graph construction, random walk diagnosis, result precision evaluation, metrics weight update. Initially, we introduce the concept of implicit metrics and propose a composite impact graph construction algorithm, using multiple types of metrics to discover causal relationships between services. Afterwards, we propose a diagnostic algorithm in which forward, selfward and backward transitions are designed to heuristically identify the root cause services. In addition, we establish a self-adaptive mechanism to update the confidence of different metrics dynamically according to their diagnostic precision. Lastly, we develop a prototype system and integrate MS-Rank into real production system - IBM Cloud. Experimental results show that MS-Rank has a high diagnostic precision and its performance outperforms several selected benchmarks. Through multiple rounds of diagnosis, MS-Rank can optimize itself effectively. MS-Rank can be rapidly deployed in various microservice-based systems and applications, requiring no predefined knowledge. MS-Rank also allows us to introduce expert experiences into its framework to improve the diagnostic efficiency and precision.

Original languageEnglish
Pages (from-to)1399-1410
Number of pages12
JournalIEEE Transactions on Services Computing
Volume15
Issue number3
DOIs
StatePublished - 2022
Externally publishedYes

Keywords

  • Microservice architecture
  • anomaly detection
  • cloud computing
  • impact graph
  • root cause

Fingerprint

Dive into the research topics of 'Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice Architecture'. Together they form a unique fingerprint.

Cite this