Dhruv Nigam

@dhruvn

Your voice agent is (probably) doomed, OR how not to fall victim to outdated voice agent playbooks

Submitted Jun 17, 2026

Googly Bhai was busy this IPL season. He live-streamed to an **audience of 300,000 every day **, with 1.5 million minutes of watch time and 1.1 million concurrent viewers at peak. Hundreds of other streamers called him onto their streams to discuss live scores, gossip, and make predictions (see him live jamming with another streamer). He switched to Aussie and British accents mid-conversation at the audience’s request and effortlessly switched between 80+ languages, including Bengali, Telugu, and Punjabi.

Googly Bhai, despite the name, is not a small-time Mumbai goon. He’s the world’s first real-time AI sports streaming companion - a sarcastic, street-smart half-monkey who loves Virat (and lets it show). He reacts to live match events, runs trivia mid-stream, reads audience chats in real time, and pulls up live stats - all at roughly 2-second latency.

When we started building the AI voice companion for live sports entertainment for our 5M daily users on Dream11 in 2025, conventional wisdom asked us to treat it as an extension of LLM chatbots. Unfortunately, we followed this advice, and it cost us 5 months of wasted time building and discarding 8 prototypes and rewriting our entire codebase three times before arriving at something that worked. The insights we earned are surprising, non-trivial, and until now, insider knowledge. Through this talk, I want to share them with everyone.

The chatbot extension school of thought assumes that building a voice agent means adding plumbing over an LLM-powered chatbot. Just add an STT and a TTS, sandwich the LLM in the middle, and you’ve built a voice agent, right? Right? WRONG! Nothing could be further from the truth. Voice deserves its own native components:

  • Full duplex speech-to-speech models - not an LLM sandwiched between an STT and a TTS
  • A network protocol built for voice - not TCP, which was designed to bring your mail
  • A way to inject functionality without blocking or overpowering the conversation - not tools and MCPs, which work alright for chatbots but fail spectacularly for voice agents.

In addition, I’ll share how to think about scaling concurrency in a world where everyone is GPU-bound.

All of these lessons will be corroborated by real numbers in production that we will share !

What I hope the audience will take away

A battle-tested playbook they can take back to their desk and use on Monday. A playbook that doesn’t exist in the open right now. Four specific lessons:

  1. Cascaded STT→LLM→TTS pipelines are broken for conversations. Latency adds up and conversational context decays at every boundary. Speech-to-speech is the future.
  2. WebSockets/REST APIs on TCP will kill real-time voice experience because TCP values correctness more than latency. WebRTC on UDP was built for multimedia communication and prioritizes latency over correctness.
  3. Tool calls create dead air and increase cognitive overload for models. People can wait 10 seconds for an answer but not for a response. There is a difference between latency and perceived latency with voice agents that is important to get right. The async subagent pattern keeps conversation alive while heavy computation runs in the background.
  4. The size of your model providers matters. Bigger providers can offer pay-as-you-go pricing for higher scales since they can aggregate demand. Small labs hit GPU provisioning walls fast. And if you’re thinking about self-hosting, I will break your heart with facts in this talk.

Qui Bono? Who will benefit?

Engineers building voice agents or real-time AI applications where conversation quality is paramount. Product and engineering leaders deciding architecture for voice-first products. Anyone who’s hit the wall where their voice agent demo works but their production system doesn’t.

Dhruv Nigam - Speaker Bio

Staff ML Engineer, Dream11

Website · LinkedIn · GitHub · Twitter/X


Dhruv Nigam is a Staff ML Engineer at Dream11, India’s largest fantasy sports platform. At Dream11, he built LUMOS, a 200M-parameter foundation model trained on 1.7 trillion tokens that replaced 50+ task-specific models and lifted DAU by 3% overnight. Most recently, his team launched World’s first live sports AI influencer, a multi-agent voice system that has reached over a million users. Before Dream11, he worked as a quantitative analyst at an investment bank, building trading algorithms on billions in capital.

He has presented at KDD Barcelona, Swiggy, and MumbaiPy among other venues and writes about ML engineering at ML Trenches. His interests lie in LLM post-training, inference optimization, and agentic systems.

Other Talks · Writing · Publications

References

Slides(WIP) -
https://docs.google.com/presentation/d/1XFxKHhwdDqJx7EIQW53EIVxL84DEn1AOggUDI3yJpUY/edit?usp=sharing

Official Dream11 blog about the AI companion-
https://medium.com/dreamlockerroom/how-we-built-the-worlds-first-real-time-ai-sports-streaming-companion-a42273de1fca

Personal blog on lessons learned building voice agents-
https://mltrenches.substack.com/p/what-i-learned-building-voice-agents

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures