Root Cause Analysis of Failures in Microservices through Causal Discovery
Azam Ikram
Advised by Prof. Saurabh Bagchi
https://purdue-edu.zoom.us/j/93166268645?pwd=UlpSYWtqbTh3WFdHNEdQL29ELy9ldz09
Meeting ID: 931 6626 8645
Passcode: 800947
Abstract: Most cloud applications use a large number of smaller sub-components (called microservices) that interact
with each other in the form of a complex graph to provide the overall functionality to the user. While the modularity of the microservice architecture is beneficial for rapid software development, maintaining and debugging such a system quickly in cases of
failure is challenging. We propose a scalable algorithm for rapidly detecting the root cause of failures in complex microservice architectures. The key ideas behind our novel hierarchical and localized learning approach are: (1) to treat the failure as an
intervention on the root cause to quickly detect it, (2) only learn the portion of the causal graph related to the root cause, thus avoiding a large number of costly conditional independence tests, and (3) hierarchically explore the graph. The proposed technique
is highly scalable and produces useful insights about the root cause, while the use of traditional techniques becomes infeasible due to high computation time. Our solution is application agnostic and relies only on the data collected for diagnosis. For the
evaluation, we compare the proposed solution with a modified version of the PC algorithm and the state-of-the-art for root cause analysis. The results show a significant improvement in top-k recall while significantly reducing the execution time.
Bio: I am a second-year Ph.D. student at ECE Purdue University working with Prof. Saurabh Bagchi. I'm broadly interested
in large-scale distributed systems, cloud computing, computer networks, and applied machine learning in systems. My current research work focuses on providing fault diagnosability for cloud applications through causal inference. Another part of my work concentrates
on making the serverless framework more applicable to the end user by reducing the latency and cost of serverless DAGs. Before joining Purdue, I was at LUMS where I worked with Prof. Zafar Ayyub Qazi to improve the latency of cellular control plane messages
of 4G/LTE.
Mary Ann Satterfield
Sr. Administrative Assistant
Elmore Family School of Electrical and Computer Engineering
Electrical Engineering Building
465 Northwestern Ave., BHEE 326B
West Lafayette, IN 47907
o: 765-494-6389 m: 765-490-6392 f: 765-494-2706