Call for submissions: Platform Engineering Meet-ups

Share your journey of building platforms that power engineering teams

Srinivas Anant

@jsrinivas_anant

Observability in Kubernetes: The good, the bad, and the ugly

Submitted Jan 2, 2026

Overview

As our Kubernetes footprint grew, so did the challenge of understanding what was happening inside our cluster. We found ourselves drowning in logs from multiple sources, missing critical alerts, and struggling to connect the dots between metrics, logs, and events when things went wrong.

This talk shares our journey of building a single pane for observability. I’ll walk you through what worked, what didn’t, and the hard lessons we learned along the way.

We’ll cover:

  • How we consolidated logs, metrics, and events from multiple sources
  • Building real-time alerting using Clutch, Temporal, and VictoriaMetrics
  • Using NATs and TimescaleDB for incident response and on-call alerting

Key takeaways

  • Real-world patterns for consolidating observability data from multiple sources
  • Common pitfalls to avoid when building a single pane for alerting

Audience

This talk is for Platform Engineers and Infrastructure Engineers who are either:

  • Setting up observability for a new Kubernetes cluster
  • Struggling with fragmented monitoring across multiple sources
  • Looking for practical ways to reduce alert fatigue and improve incident response

About Me

I work as a Senior Member of Technical Staff @ Nutanix Technologies India Pvt Ltd. I have expertise in managing and maintaining distributed systems, including Kubernetes, VictoriaMetrics, and HashiStack, by stitching together different tools to create a robust infrastructure. I enjoy tinkering with distributed systems. Outside of work, I am an avid reader and enjoy anything related to science fiction.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy