Rollup: Managing 300 Billion Daily Metrics at PhonePe

Nov 2024

18 Mon

19 Tue

20 Wed

21 Thu

22 Fri 09:00 AM – 05:10 PM IST

23 Sat

24 Sun

Bangalore International Centre, Bengaluru

Tickets

All submissions

Previous Next

Preview video

Rollup: Managing 300 Billion Daily Metrics at PhonePe

Submitted Oct 28, 2024

Submission type: 40 min talk Track in which your submission fits: Systems engineering

Overview

The Metrics Platform enables our engineers at Phonepe to monitor their services around the clock. This platform stores and serves the data that powers Grafana dashboards and the anomaly detection alert infrastructure. All metrics are stored in time series database - OpenTSDB, a well-established project in the open-source domain.

In the realm of monitoring and data analysis, efficiently managing vast amounts of time-series data is crucial. As data volume grows, storing high-resolution data and querying it becomes increasingly challenging. Phonepe’s Metrics platform addresses these challenges through a feature known as ‘Rollup’.

Rollup is the process of aggregating time-series data over specified intervals. Think of Rollup as weather summaries. Rather than checking every minute-by-minute temperature reading, you record the daily highs and lows, capturing the overall trends without getting bogged down by too much data.

Challenges

Implementing the Spark job alongside the daily metrics processing (ingesting ~3.8 million metrics per second) posed a significant challenge due to the extensive data scans by the Rollup Spark Job, which adversely affected the HBase cluster’s performance. After experimenting with various patterns involving live table scans, we opted to use Table Snapshots instead. This approach yielded a ~7x performance improvement without affecting ingestion rates.

Furthermore, introducing split and merge query support for fetching data from both raw and rolled-up datasets ensured a ~10x performance enhancement for historical queries, all while maintaining backward compatibility.

Agenda

Why we introduced Rollup in TimeSeries Database(Opentsdb)
Deep dive into design and architecture of Rollup at scale (~3.8 Million metrics/second)
Tech stack : Spark, Opentsdb, Hbase and Kafka
Challenges and Learnings during Rollup Implementation at scale.

Key takeaways

How rollup helped enhance the performance and reliability of timeseries database.
Query Performance Improvement: ~10x faster queries as compared to those fetching raw data.
Daily Storage Optimisation: ~2.5 TB rolled up and reduced to ~40GB
Substantial Storage and Infra Cost Savings: 64x Storage-saving roughly as compared to raw data.

Audience

Site Reliability and DevOps Engineers
Engineering leaders
Cloud architects and engineers
Teams building large-scale observability solutions

All submissions

Previous Next

Comments

Nov 2024

18 Mon

19 Tue

20 Wed

21 Thu

22 Fri 09:00 AM – 05:10 PM IST

23 Sat

24 Sun

Hybrid Access Ticket

Hosted by

Rootconf

We care about site reliability, cloud costs, security and data privacy

Supported by

Platinum Sponsor

Nutanix Technologies India Private Limited

Nutanix is a global leader in cloud software, offering organizations a single platform for running apps and data across clouds.

Platinum Sponsor

PhonePe Private Limited

PhonePe was founded in December 2015 and has emerged as India’s largest payments app, enabling digital inclusion for consumers and merchants alike.

Silver Sponsor

e6data

The next-gen analytics engine for heavy workloads.

Sponsor

Swiggy

Community sponsor

Peak XV Partners

Peak XV Partners (formerly Sequoia Capital India & SEA) is a leading venture capital firm investing across India, Southeast Asia and beyond.

Venue host - Rootconf workshops

Thoughtworks

Thoughtworks is a pioneering global technology consultancy, leading the charge in custom software development and technology innovation.

Community Partner

FOSS United Foundation

FOSS United is a non-profit foundation that aims at promoting and strengthening the Free and Open Source Software (FOSS) ecosystem in India. more

Community Partner

Rust Bangalore

A community of Rust language contributors and end-users from Bangalore. We have presence on the following telegram channels https://t.me/RustIndia https://t.me/fpncr LinkedIn: https://www.linkedin.com/company/rust-india/ Twitter (not updated frequently): https://twitter.com/rustlangin more

Rootconf Mini 2024 (on 22nd & 23rd Nov)

Rollup: Managing 300 Billion Daily Metrics at PhonePe

Overview

Challenges

Agenda

Key takeaways

Audience

Comments