Rows, columns, and consequences

Speak at Rootconf’s Special Edition on Databases

Varun Mishra

Breaking the Linear Bottleneck: Re-Architecting HBase Backups

Submitted Apr 29, 2026

Abstract
For critical distributed systems, data durability is the ultimate operational baseline. However, at extreme scales, handling millions of OLTP operations per second across petabyte-sized datasets, traditional backup strategies often become the very bottleneck they were designed to prevent.

In this talk, we dive into how we re-engineered the Apache HBase Backup and Restore framework to support a cloud-native, parallel execution model. The existing open-source implementation relies on a sequential, linear model that uses exclusive table-level locking, forcing all backup operations into a strict serial queue. For large clusters, this results in an operationally unsustainable Recovery Point Objective (RPO), a window of potential data loss that no longer meets modern business continuity requirements.

We will explore the architectural “Rename Trap” encountered when moving backup chains to cloud object stores like GCS/S3 and how we decoupled HBase from legacy HDFS filesystem assumptions. You will learn the technical intricacies of designing a parallel-safe locking mechanism using composite lock values and algorithm for WAL (Write-Ahead Log) retention to prevent unbounded disk growth.

Key Takeaways

  1. The Linear Bottleneck: Understand how sequential backup models fail in high-throughput environments and lead to “false notions” of RPO.

  2. Solving the Rename Trap: A technical deep dive to enable cloud-optimized committers in the HBase execution engine.

  3. Hard Metrics: A look at how these architectural shifts reduced our mandated RPO from >12 hours down to a targeted 1–3 hours..

  4. Operational Reliability: Practical techniques for debugging distributed deadlocks and ensuring consistency across parallel chains during rolling restarts.

Target Audience
This session is designed for Backend Engineers, Systems Designers, and SREs who are interested in database internals and the practical approaches in building and scaling distributed stateful systems.

About Me
Varun Mishra, senior software engineer (SDE-III) at Flipkart, working on centrally managed platforms. We are solving for high scale distributed systems and their reliability. Varun has more than 7 years of experience in software development and more than 5 years working on databases.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy