Jun 2026
8 Mon
9 Tue
10 Wed
11 Thu
12 Fri 09:00 AM – 06:00 PM IST
13 Sat 09:00 AM – 06:00 PM IST
14 Sun
Varun Mishra
@varunmishra
Submitted Apr 29, 2026
Abstract
In highly coordinated distributed systems like Apache HBase, operations often rely on global barriers, synchronized procedures that require every node to reach a consensus point before moving forward. At extreme scales, these barriers become highly sensitive to thread contention and coordination overhead. This talk details a real-world production incident at Flipkart where a critical disaster recovery pipeline hit a hard “60-second wall”, consistently failing due to a hidden architectural flaw.
We will walk through the journey of diagnosing a scale-dependent deadlock that only manifested in 50+ node production clusters while remaining invisible in smaller staging environments. Attendees will learn how a seemingly harmless, redundant synchronous RPC call from a worker node back to the central coordinator created a circular dependency in the Master’s RPC handlers, causing the entire cluster-wide log roll procedure to time out.
The session covers the debugging methodology used to prove the deadlock, including the use of synchronized, multi-instance thread dumps across hundreds of nodes. Finally, we discuss the architectural shift required to solve it: decoupling local worker tasks from synchronous callbacks during time-sensitive global barriers.
Key Takeaways
Target Audience
This session is designed for Backend Engineers, Systems Designers, and SREs who are interested in database internals and the practical approaches in building and scaling distributed stateful systems.
Slides Deck
https://docs.google.com/presentation/d/1ylQRYiXmmRqha3vJ0aEtzxNiG5mKG-Td8KQiOHQNsPM/edit?usp=sharing
About Me
Varun Mishra, senior software engineer (SDE-III) at Flipkart, where I am working on centrally managed platforms. We are solving for high scale distributed systems and their reliability. Varun has more than 7 years of experience in software development and more than 5 years working on databases.
Hosted by
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}