Nov 2024
18 Mon
19 Tue
20 Wed
21 Thu
22 Fri 09:00 AM – 05:10 PM IST
23 Sat
24 Sun
Nov 2024
18 Mon
19 Tue
20 Wed
21 Thu
22 Fri 09:00 AM – 05:10 PM IST
23 Sat
24 Sun
The Metrics Platform enables our engineers at Phonepe to monitor their services around the clock. This platform stores and serves the data that powers Grafana dashboards and the anomaly detection alert infrastructure. All metrics are stored in time series database - OpenTSDB, a well-established project in the open-source domain.
In the realm of monitoring and data analysis, efficiently managing vast amounts of time-series data is crucial. As data volume grows, storing high-resolution data and querying it becomes increasingly challenging. Phonepe’s Metrics platform addresses these challenges through a feature known as ‘Rollup’.
Rollup is the process of aggregating time-series data over specified intervals. Think of Rollup as weather summaries. Rather than checking every minute-by-minute temperature reading, you record the daily highs and lows, capturing the overall trends without getting bogged down by too much data.
Implementing the Spark job alongside the daily metrics processing (ingesting ~3.8 million metrics per second) posed a significant challenge due to the extensive data scans by the Rollup Spark Job, which adversely affected the HBase cluster’s performance. After experimenting with various patterns involving live table scans, we opted to use Table Snapshots instead. This approach yielded a ~7x performance improvement without affecting ingestion rates.
Furthermore, introducing split and merge query support for fetching data from both raw and rolled-up datasets ensured a ~10x performance enhancement for historical queries, all while maintaining backward compatibility.
Why we introduced Rollup in TimeSeries Database(Opentsdb)
Deep dive into design and architecture of Rollup at scale (~3.8 Million metrics/second)
Tech stack : Spark, Opentsdb, Hbase and Kafka
Challenges and Learnings during Rollup Implementation at scale.
How rollup helped enhance the performance and reliability of timeseries database.
Query Performance Improvement: ~10x faster queries as compared to those fetching raw data.
Daily Storage Optimisation: ~2.5 TB rolled up and reduced to ~40GB
Substantial Storage and Infra Cost Savings: 64x Storage-saving roughly as compared to raw data.
Site Reliability and DevOps Engineers
Engineering leaders
Cloud architects and engineers
Teams building large-scale observability solutions
Nov 2024
18 Mon
19 Tue
20 Wed
21 Thu
22 Fri 09:00 AM – 05:10 PM IST
23 Sat
24 Sun
Hosted by
Supported by
Platinum Sponsor
Platinum Sponsor
Community sponsor
Venue host - Rootconf workshops
Community Partner
Community Partner
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}