Rows, columns, and consequences

Speak at Rootconf’s Special Edition on Databases

sarthak makhija

@sarthak_makhija

Fast on Paper, Slow in Reality: What We Got Wrong About Performance

Submitted Apr 25, 2026

Description

In distributed systems engineering, a design that is “correct on paper” is only the beginning; the real challenge is making it “fast in reality.” This session offers a transparent post-mortem of the architectural assumptions we made while building a distributed key-value store from scratch in Go, and why several of those assumptions collapsed under production-grade pressure. We’ll move beyond high-level design to deconstruct the hidden performance bottlenecks within standard distributed patterns, exploring how generalized 2-Phase Commit (2PC) became a crippling bottleneck, why our waiting list built on Go’s standard mutex became a global point of contention, and why our initially “standard” transactional steps led to redundant network and disk I/O that unexpectedly doubled our latency.

By deconstructing these failures, we provide a practical roadmap for building distributed stateful systems that perform as well in production as they do on paper. We will discuss our remediation journey: from bypassing protocol stages for localized transactions to implementing storage-layer batching and eliminating redundant network calls to local nodes. Attendees will leave with a clear understanding of how to bridge the gap between theoretical correctness and reality in high-scale distributed databases.

Takeaways

  • Protocol Fast-Paths: Learn how to identify “safe paths” in distributed transactions to bypass the 2PC tax and significantly reduce latency for shard-local operations.
  • Lock Partitioning: Practical strategies for managing high-concurrency bottlenecks in Go by moving from global locks to partitioned lock groups (using concurrent maps like xsync) to isolate contention across different request paths and correlation IDs.
  • Defensive Storage Design: Why storage-layer pagination and I/O batching are critical for preventing “OOM” and latency spikes during large-scale range queries and high-throughput operations.
  • Scaling Inter-Node IO: How moving from single to multiple persistent outbound connectors per partition can dramatically increase replication throughput and resiliency.

Target Audience

This session is designed for Backend Engineers, Systems Designers, and SREs who are interested in database internals and the practical performance trade-offs inherent in building and scaling distributed stateful systems.

Bio

Sarthak Makhija is a Principal Architect at Caizin specializing in storage engines and distributed systems. While at ThoughtWorks, he led the development of a strongly consistent, distributed key-value storage engine in Go from scratch.

He is a contributor to the book Patterns of Distributed Systems and writes about database internals on his blog, tech-lessons.in.

Sarthak also conducts workshops on the “Internals of key-value storage engines: LSM-trees and beyond” and Rust.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy