Rows, columns, and consequences

Speak at Rootconf’s Special Edition on Databases

Vaibhaw Pandey

Beyond Polling: Building an Event-Driven State Engine for Multi-Cluster Database Control Planes

Submitted Apr 30, 2026

Description

You run a database control plane that manages PostgreSQL clusters across dozens of remote Kubernetes environments. You need your internal state to reflect reality — which pods are running, which clusters are healthy, which replicas just failed over. The naive approach is obvious: poll every cluster’s API server every 30 seconds and diff the result against your metadata store. We prototyped that. It worked fine at 5 clusters. But load testing at 20 clusters with hundreds of database pods showed thousands of redundant API calls per minute, wasted bandwidth on unchanged state, and unacceptable pressure on customer API servers during pod churn events. The scaling ceiling was clear before we ever put it in front of customers. So we designed an event-driven state refresh engine using Kubernetes Informers, work queues, and a reference-counted cluster discovery mechanism — the same primitives that Kubernetes controllers use internally, but applied to the problem of keeping an external control plane synchronized with multiple remote clusters.

This talk walks through the engineering decisions behind that design. I’ll cover the three hardest problems we hit: (1) Cluster discovery — how do you know which clusters to watch, when to start watching a new one, and when to stop, without polling yet another service? We evaluated three approaches and landed on watching our own workload entities as the trigger. (2) Reconnection semantics — what happens when a network partition drops your watch connection and the API server’s event history has moved past your last resource version? The Informer handles the 410 Gone relist automatically, but that only rebuilds the local cache. Your external metadata store drifted independently during the outage, and the relist gives you current state without a diff of what you missed — so you need a full reconciliation pass that’s both correct and cheap enough to run on every reconnect. (3) Running a singleton subsystem inside a replicated service — the state engine must run as exactly one instance for correctness, but it’s embedded in a service that needs multiple replicas for availability. I’ll explain the leader election approach via Kubernetes Leases and why “just extract it into a separate service” isn’t always the right first move.

Takeaways

  1. A decision framework for choosing between polling, event-driven watches, and hybrid approaches when synchronizing external state with Kubernetes — with concrete criteria (cluster count, event frequency, acceptable staleness) that determine which pattern fits.

  2. The three non-obvious failure modes of multi-cluster Informer architectures — watch history expiry (410 Gone), credential rotation under active watches, and metadata-store drift during reconnection — and the recovery patterns that handle each without data loss.

Who is this for?

Platform engineers building control planes that manage workloads across multiple Kubernetes clusters. Infrastructure engineers designing state synchronization between Kubernetes and external systems (CMDBs, internal platforms, multi-cluster orchestrators). Anyone who has outgrown polling the Kubernetes API and needs a scalable event-driven alternative — or anyone who is about to hit that wall and wants to skip the painful intermediate steps.

Bio

Marko Nikolic is a Lead Engineer working at Nutanix for Nutanix Database Service (NDB), focusing on the intersection of cloud-native orchestration and high-performance database systems. He specializes in building resilient control planes and scaling stateful infrastructure on Kubernetes. LinkedIn.
Vaibhaw Pandey is a Senior Engineer dedicated to database automation and lifecycle management. With extensive experience in distributed systems, he currently focuses on developing scalable discovery and state synchronization engines for global-scale database deployments. LinkedIn.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy