Rootconf Mini 2024

Geeking out on systems and security since 2012

Tickets

Loading…

Kanika Khetawat

@kanikakhetawat

Santanu Sinha

Santanu Sinha

@santanusinha

Spyglass: Graph based automated RCA tool

Submitted Oct 25, 2024

Abstract

In a microservices architecture, detecting issues quickly becomes a challenge with high scale. At PhonePe we handle about a million requests per second on the edge. This translates to tens and hundreds of millions of service calls across thousands of service containers across the system. Traditional detection mechanisms like distributed tracing typically generate too much data for easy management and analysis or end up being too oversampled to find out issues fast enough. In this talk, we will discuss Spyglass, a solution designed to enable fast drill down to the root cause of failures across a large service oriented distributed system operating at high scale.

Spyglass is a graph-based solution that captures the interactions between services, as well as the internal calls (such as database and queue operations) made within a service. These interactions are captured as metrics. These metrics can be used to understand the overall flow of requests and to quickly identify the service all the way down to the sub-system or component experiencing issues during an outage. It leverages the monitoring metrics pushed by each service, along with an in-house Anomaly Detection System, to assess the health of each graph node.

Agenda

  • Why we developed Spyglass
  • Deep dive into design and architecture of Spyglass
  • Discuss the service dependency and automated RCA support
  • Real time alerts using Spyglass

Key Takeaways

  • Fundamentals of failure prediction and detection at scale
  • How Spyglass improves observability and enables identification of failures
  • Design of the metric ingestion, management, anomaly detection and the spyglass system at PhonePe

Audience

  • Developers
  • Site Reliability Engineers
  • Engineering Managers

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid Access Ticket

Hosted by

We care about site reliability, cloud costs, security and data privacy

Supported by

Platinum Sponsor

Nutanix is a global leader in cloud software, offering organizations a single platform for running apps and data across clouds.

Platinum Sponsor

PhonePe was founded in December 2015 and has emerged as India’s largest payments app, enabling digital inclusion for consumers and merchants alike.

Silver Sponsor

The next-gen analytics engine for heavy workloads.

Venue host - Rootconf workshops