Rows, columns, and consequences

Speak at Rootconf’s Special Edition on Databases

Vaibhaw Pandey

Architecting a Scalable and Resilient PostgreSQL Platform on Bare Metal Kubernetes

Submitted Apr 30, 2026

Description

Running a handful of PostgreSQL instances on Kubernetes is a solved problem — the CloudNativePG operator handles it elegantly. Running ten thousand of them on bare metal — with sub-5-minute provisioning, automated failover, and 99.99% availability — is where every comfortable assumption breaks down. Operator reconciliation loops behave differently at fleet scale, storage latency becomes non-uniform in ways that surprise you, and a failover storm across a few hundred instances can cascade in ways no chaos engineering drill prepared you for. This talk is about what we learned building and operating an enterprise Database-as-a-Service platform that deploys PostgreSQL on bare metal Kubernetes clusters using Metal Stack for infrastructure provisioning, CloudNativePG for database orchestration, and Nutanix CSI for persistent storage.

We’ll walk through the architecture decisions that worked, the ones that didn’t, and the operational patterns we developed to keep thousands of databases healthy without drowning in toil. Topics include: how we tamed storage I/O variance across a multi-tenant bare metal fleet, why the operator’s default failover behavior fell apart at ~2,000 instances and what we tuned to fix it, how we achieved point-in-time recovery and zero-downtime patching across the fleet without a dedicated DBA per cluster, and the monitoring/alerting philosophy that lets a small team operate at this scale. If you’re considering moving databases onto Kubernetes — or already have and are hitting walls — this is the talk that tells you what’s on the other side.

Takeaways:

  1. A practical architecture blueprint for running PostgreSQL at scale on bare metal Kubernetes — including storage configuration, failover orchestration, and backup/restore patterns that survive real production failures.
  2. Specific failure modes and operational gotchas that only emerge at scale (1,000+ instances) — the kind that don’t appear in staging or proofs-of-concept — and the design patterns that mitigate them.

Who is this for?

Platform engineers building internal Database-as-a-Service offerings, database administrators evaluating Kubernetes for stateful workloads, and infrastructure architects making the bare metal vs. cloud decision for databases. Also useful for anyone running stateful workloads on Kubernetes who wants to understand where the scaling cliffs are and how to design around them.

Bio

Krunal Jhaveri is senior engineering manager at Nutanix, Inc. California, focusing on cloud infrastructure and data services. He specializes in designing and implementing large-scale stateful workload solutions, helping enterprises modernize their database operations and leverage automation for improved availability and agility. LinkedIn.

Vaibhaw Pandey is a Staff Engineer at Nutanix, Bengaluru. He focuses on the intersection of database operations and platform engineering, helping large enterprises design, automate, and scale mission-critical stateful workloads on Kubernetes . LinkedIn

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy