Nov 2024
4 Mon
5 Tue
6 Wed
7 Thu
8 Fri
9 Sat
10 Sun
Nov 2024
11 Mon
12 Tue
13 Wed
14 Thu
15 Fri
16 Sat
17 Sun
Nov 2024
18 Mon
19 Tue
20 Wed
21 Thu
22 Fri 09:00 AM – 03:20 PM IST
23 Sat 09:30 AM – 06:15 PM IST
24 Sun
Have you ever experienced an abrupt service shutdown in production due to the inability to monitor CPU utilization and memory spikes post-deployment? If so, you understand the critical importance of service metrics monitoring.
At PhonePe, we empower our engineers to continously monitor their systems using Opentsdb. On top of these metrics, we have built in house alerting system Anomaly detection which helps the teams to get real time alert for any anomalies. More than 200 clients push more than 400 billion metrics a day and peak touching 5 millon metrics per sec. We retain these raw metrics for 30 days and rolled up metrics for 365 days. Overall cluster footprint is close 80 Baremetals holding terabytes of data
In this talk, we will talk about -
Systems architecture of our Metrics platform along with Opentsdb.
We will deep dive into system system optimisations we have done over the years to scale our Kafka and HBase which acts as the backbone of our platform.
Production outages and remediations
Key take aways
How we scaled Opentsdb to handle 400 billion metrics a day
Rollup of metrics using Spark
Feedback loop to build intelligence system
Dos/Dont’s for managing larger infrastructure
This session is useful for :
Developers
SRE/Devops
Engineering Managers
Hosted by
Supported by
Platinum Sponsor
Platinum Sponsor
Venue host - Rootconf workshops
Community Partner
Community Partner
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}