Call for submissions: Platform Engineering Meet-ups

Call for submissions: Platform Engineering Meet-ups

Share your journey of building platforms that power engineering teams

Dev Kulkarni

@dev_kulkarni Co-author

Sakib Malik

@sakibmalik Co-author

Worker Controller: A Multi-tenant Consumer Proxy for Consumption at Scale

Submitted Oct 1, 2025

Session Description

At Zomato, the journey from a customer placing an order to its fulfillment, and beyond, depends on a complex web of systems running seamlessly in the background. To power this at a massive scale, we rely on an event-driven architecture built on Kafka and SQS. With thousands of topics and queues in production — and new ones being added almost daily as teams roll out features or system improvements — managing this ecosystem is no small task. Every new topic or queue demands corresponding compute resources and workers to consume them. Managing these workers consistently, reliably, and safely has always been a challenge — until now. In this talk, we introduce Worker Controller, our central multi-tenant standardized consumer and remote processor designed to simplify worker creation, improve reliability, allow canary support, and accelerate feature delivery across the organization.

Worker Controller abstracts away the complexities of queue consumption from Kafka and SQS by shifting the consumer architecture from a traditional pull-based model to a centralized push-based proxy. Today, it manages over 900 Kafka topics, over 100 SQS queues, and more than 1,000 consumers, handling peaks of ~5 million requests per minute. This proxy layer takes ownership of partition-level offset management, retries, dead-letter queues, batching, and observability by default. Because consumption and processing are separated — with the controller centrally consuming and dispatching work to workers — it becomes much easier to handle poison payloads safely. Its RPC-driven design makes creating a new worker as simple as writing an RPC method, drastically reducing operational overhead for SREs. More importantly, the push-based model unlocks canary deployments for workers, allowing developers to test new code in production with controlled traffic, catch regressions early, and deploy with confidence. Also being a central component, Worker Controller simplifies governance, making it easier to plan for peak days and ensure that all teams follow best practices consistently.

Key Takeaways

We’ll explore how Worker Controller addresses three critical challenges: resilience, by enforcing retry budgets and DLQs to prevent cascading failures, handling poison payloads gracefully; control, by allowing dynamic throttling, pause/resume, and gradual rate-limiting of consumption to protect upstream systems; and fairness, by ensuring balanced resource allocation across queues so no “noisy neighbour” can dominate. In Kafka use cases where strict ordering is not required, Worker Controller also supports parallel consumption from a partition by sacrificing ordering — delivering higher throughput reliably.

By the end of this talk, you’ll learn best practices and patterns for building reliable and manageable consumers in an event-driven architecture. We’ll explore how to balance developer velocity with operational safety, rethink worker lifecycle management, and see how centralizing these concerns can unlock both innovation and stability at scale.

Target Audience

This session will be beneficial for a wide range of audiences — from beginners who are just starting to understand event-driven systems, to experienced engineers and architects looking for ways to scale and manage complex consumer ecosystems.

Bio

We have two speakers for this session — their bios are shared below.

Sakib Malik is a Senior Software Engineer (SDE III) at Zomato, part of the Site Reliability Engineering team building platform services for large-scale, distributed systems. He has worked on high-performance libraries and tools, including contributions to the Golang runtime, profiler, and Redis clients. Sakib also designed resilient infrastructure components such as the Worker Controller, a Kafka consumer proxy, and gomemlimit, a memory limiter for Go services that significantly reduced operational costs. He is passionate about performance optimization, reliability, and building developer tooling that enables teams to move faster with confidence.

Dev Kulkarni is a Software Engineer (SDE I) at Zomato, part of the Site Reliability Engineering team. He has contributed to the development of Worker Controller, Zomato’s centralized Kafka and SQS consumer platform. Dev is interested in building and learning about highly scalable, resilient systems that improve reliability and empower teams to deliver at scale.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy