Faster, deeper, easier: Crowdsourcing diagnosis of microservice kernel failure from user space

Yicheng Pan, Meng Ma, Xinrui Jiang, Ping Wang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

19 Scopus citations

Abstract

With the widespread use of cloud-native architecture, increasing web applications (apps) choose to build on microservices. Simultaneously, troubleshooting becomes full of challenges owing to the high dynamics and complexity of anomaly propagation. Existing diagnostic methods rely heavily on monitoring metrics collected from the kernel side of microservice systems. Without a comprehensive monitoring infrastructure, application owners and even cloud operators cannot resort to these kernel-space solutions. This paper summarizes several insights on operating a top commercial cloud platform. Then, for the first time, we put forward the idea of user-space diagnosis for microservice kernel failures. To this end, we develop a crowdsourcing solution - DyCause, to resolve the asymmetric diagnostic information problem. DyCause deploys on the application side in a distributed manner. Through lightweight API log sharing, apps collect the operational status of kernel services collaboratively and initiate diagnosis on demand. Deploying DyCause is fast and lightweight as we do not have any architectural and functional requirements for the kernel. To reveal more accurate correlations from asymmetric diagnostic information, we design a novel statistical algorithm that can efficiently discover the time-varying causalities between services. This algorithm also helps us build the temporal order of the anomaly propagation. Therefore, by using DyCause, we can obtain more in-depth and interpretable diagnostic clues with limited indicators. We apply and evaluate DyCause on both a simulated test-bed and a real-world cloud system. Experimental results verify that DyCause running in the user-space outperforms several state-of-the-art algorithms running in the kernel on accuracy. Besides, DyCause shows superior advantages in terms of algorithmic efficiency and data sensitivity. Simply put, DyCause produces a significantly better result than other baselines when analyzing much fewer or sparser metrics. To conclude, DyCause is faster to act, deeper in analysis, and easier to deploy.

Original languageEnglish
Title of host publicationISSTA 2021 - Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis
EditorsCristian Cadar, Xiangyu Zhang
PublisherAssociation for Computing Machinery, Inc
Pages646-657
Number of pages12
ISBN (Electronic)9781450384599
DOIs
StatePublished - 11 Jul 2021
Externally publishedYes
Event30th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2021 - Virtual, Online, Denmark
Duration: 11 Jul 202117 Jul 2021

Publication series

NameISSTA 2021 - Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Conference

Conference30th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2021
Country/TerritoryDenmark
CityVirtual, Online
Period11/07/2117/07/21

Keywords

  • Dynamic service dependency
  • Granger causal intervals
  • Microservice system
  • Root cause analysis

Fingerprint

Dive into the research topics of 'Faster, deeper, easier: Crowdsourcing diagnosis of microservice kernel failure from user space'. Together they form a unique fingerprint.

Cite this