Rows, columns, and consequences

Speak at Rootconf’s Special Edition on Databases

Akkireddy Gunta

Zero Downtime, Zero Excuses: How YugabyteDB Keeps Running When Everything Goes Wrong

Submitted Apr 30, 2026

With YugabyteDB, the answer is roughly three seconds, and no data is lost. But that answer comes with a lot of engineering underneath it. Tablets. Raft consensus. Quorum writes. Leader elections. Fault domains. None of these is magic; they’re deliberate design decisions with real trade-offs, and understanding them is the difference between an on-call engineer who panics and one who knows exactly what is happening and why it will resolve itself.

The session walks through the mechanisms behind each recovery step, i.e., tablet leader detection, quorum-based write safety, Raft election, and cluster rebalance, and the configuration decisions associated with each: replication factor, election timeout, and follower lag threshold. Things you can go back and reason about in your own cluster.

Fault domains get the same treatment. RF=3 sounds safe. It isn’t, if your three replicas share a zone. We cover what node-level, zone-level, and region-level fault tolerance each actually guarantees, and what they cost in write latency and infrastructure, so you can size your replication factor against a real failure scenario rather than a default.

The second half of the talk tackles something even scarier than a crash: a planned upgrade. Rolling upgrades in YugabyteDB are designed to be zero-downtime, but “zero-downtime” is a claim that deserves scrutiny. We’ll walk through exactly how it works: leaders are migrated off a node before it goes offline, writes continue on the remaining nodes, and the upgraded node rejoins and triggers a rebalance. We’ll also cover the part that most upgrade documentation skips, i.e., mixed-version clusters. While nodes run two different versions simultaneously, YugabyteDB deliberately delays new wire protocol features and data formats until every node is upgraded. We’ll show what that looks like, and what your rollback path is if something goes wrong mid-upgrade.

Takeaways
Fault domain planning is where reliability actually gets configured. RF=3 across one zone is not the same as RF=3 across three zones. We’ll give you the framework to match your replication topology to your actual failure scenario.
Zero-downtime upgrades are real, but they require understanding. You’ll understand the exact upgrade sequence, how mixed-version clusters stay safe, and what the rollback path looks like, so planned maintenance stops being the scary part of your job.

Target Audience
Database engineers and SREs who run distributed databases in production and want to understand the failure mechanics, not just the recovery playbook. Platform engineers make architectural decisions about replication factors and fault-domain configurations. Anyone who has ever been on-call for a database outage and wants to understand what was actually happening under the hood, and how to make sure the next one is shorter.

Speaker Bio
Vipul Bansal, Senior software engineer at YugabyteDB, working on the Control Plane. I work on deploying and managing YugabyteDB at scale with zero downtime.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy