Rootconf Mini 2024 (on 22nd & 23rd Nov)

Geeking out on systems and security since 2012

Robin Tak

Robin Tak

@robintak

Rollup: Managing 300 Billion Daily Metrics at PhonePe

Submitted Oct 28, 2024

Overview

The Metrics Platform enables our engineers at Phonepe to monitor their services around the clock. This platform stores and serves the data that powers Grafana dashboards and the anomaly detection alert infrastructure. All metrics are stored in time series database - OpenTSDB, a well-established project in the open-source domain.

In the realm of monitoring and data analysis, efficiently managing vast amounts of time-series data is crucial. As data volume grows, storing high-resolution data and querying it becomes increasingly challenging. Phonepe’s Metrics platform addresses these challenges through a feature known as ‘Rollup’.

Rollup is the process of aggregating time-series data over specified intervals. Think of Rollup as weather summaries. Rather than checking every minute-by-minute temperature reading, you record the daily highs and lows, capturing the overall trends without getting bogged down by too much data.

Challenges

Implementing the Spark job alongside the daily metrics processing (ingesting ~3.8 million metrics per second) posed a significant challenge due to the extensive data scans by the Rollup Spark Job, which adversely affected the HBase cluster’s performance. After experimenting with various patterns involving live table scans, we opted to use Table Snapshots instead. This approach yielded a ~7x performance improvement without affecting ingestion rates.

Furthermore, introducing split and merge query support for fetching data from both raw and rolled-up datasets ensured a ~10x performance enhancement for historical queries, all while maintaining backward compatibility.

Agenda

Why we introduced Rollup in TimeSeries Database(Opentsdb)
Deep dive into design and architecture of Rollup at scale (~3.8 Million metrics/second)
Tech stack : Spark, Opentsdb, Hbase and Kafka
Challenges and Learnings during Rollup Implementation at scale.

Key takeaways

How rollup helped enhance the performance and reliability of timeseries database.
Query Performance Improvement: ~10x faster queries as compared to those fetching raw data.
Daily Storage Optimisation: ~2.5 TB rolled up and reduced to ~40GB
Substantial Storage and Infra Cost Savings: 64x Storage-saving roughly as compared to raw data.

Audience

Site Reliability and DevOps Engineers
Engineering leaders
Cloud architects and engineers
Teams building large-scale observability solutions

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy

Supported by

Platinum Sponsor

Nutanix is a global leader in cloud software, offering organizations a single platform for running apps and data across clouds.

Platinum Sponsor

PhonePe was founded in December 2015 and has emerged as India’s largest payments app, enabling digital inclusion for consumers and merchants alike.

Silver Sponsor

The next-gen analytics engine for heavy workloads.

Sponsor

Community sponsor

Peak XV Partners (formerly Sequoia Capital India & SEA) is a leading venture capital firm investing across India, Southeast Asia and beyond.

Venue host - Rootconf workshops

Thoughtworks is a pioneering global technology consultancy, leading the charge in custom software development and technology innovation.

Community Partner

FOSS United is a non-profit foundation that aims at promoting and strengthening the Free and Open Source Software (FOSS) ecosystem in India. more

Community Partner

A community of Rust language contributors and end-users from Bangalore. We have presence on the following telegram channels https://t.me/RustIndia https://t.me/fpncr LinkedIn: https://www.linkedin.com/company/rust-india/ Twitter (not updated frequently): https://twitter.com/rustlangin more