Rows, columns, and consequences

Speak at Rootconf’s Special Edition on Databases

Varun Mishra

From Timeout to Sub-Second: Solving Scale-Dependent Deadlocks in Distributed Systems

Submitted Apr 29, 2026

Abstract
In highly coordinated distributed systems like Apache HBase, operations often rely on global barriers, synchronized procedures that require every node to reach a consensus point before moving forward. At extreme scales, these barriers become highly sensitive to thread contention and coordination overhead. This talk details a real-world production incident at Flipkart where a critical disaster recovery pipeline hit a hard “60-second wall”, consistently failing due to a hidden architectural flaw.

We will walk through the journey of diagnosing a scale-dependent deadlock that only manifested in 50+ node production clusters while remaining invisible in smaller staging environments. Attendees will learn how a seemingly harmless, redundant synchronous RPC call from a worker node back to the central coordinator created a circular dependency in the Master’s RPC handlers, causing the entire cluster-wide log roll procedure to time out.
The session covers the debugging methodology used to prove the deadlock, including the use of synchronized, multi-instance thread dumps across hundreds of nodes. Finally, we discuss the architectural shift required to solve it: decoupling local worker tasks from synchronous callbacks during time-sensitive global barriers.

Key Takeaways

  1. The Circular Dependency: Understand how synchronous RPC calls within a blocked coordinator thread lead to distributed deadlocks.
  2. Practical Approach: A practical guide to using synchronized, multi-instance thread dumps (taken at fixed intervals) to definitively prove a thread is blocked rather than just slow.
  3. Hard Metrics: See how removing a single redundant check reduced the rolllog procedure time from a mandated 60,000ms timeout failure down to just a few hundred milliseconds.
  4. Architectural Rule of Thumb: Never allow workers to make synchronous callbacks to a coordinator that is currently parked waiting for those same workers.

Target Audience
This session is designed for Backend Engineers, Systems Designers, and SREs who are interested in database internals and the practical approaches in building and scaling distributed stateful systems.

About Me
Varun Mishra, senior software engineer (SDE-III) at Flipkart, where I am working on centrally managed platforms. We are solving for high scale distributed systems and their reliability. Varun has more than 7 years of experience in software development and more than 5 years working on databases.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy