Jaidev Deshpande

@jaidevd

"Boss, GPU top-up karwa do" - Monitoring Training Costs at Scale

Submitted Jun 24, 2026

Every time we ran out of GPU credits, the fix was the same: someone pinged a cloud administrator and said, “Boss, GPU top-up karwa do.” Nobody asked who burned the last batch, on what, or whether the run even finished. That one sentence (which is now a distant memory in our Slack archives) is what a GPU bill sounds like when no one owns the cost.

This talk is about the effects of that sentence. If we had a million dollars in GPU credits, what would have happened to it? Across an audit of a real GPU fleet spanning two quarters, $390,000 of that million would have paid for GPUs sitting below 5% utilization, and 293 of 510 machines would never have run a single workload. So then we migrated to a more nuanced, mindful and fail-fast style of triggering ML workflows. If a run is going to fail, the platform’s autokill stops it before it costs much — dozens of jobs died at setup having burned zero GPU-hours, where the old VMs would have idled on a live GPU for hours until someone was pinged to kill them by hand. This, of course, did not come for free. People who had learned to live and die inside Jupyter notebooks raged against the ruthless machine that terminated badly coded jobs. This talk is also about how the team’s denial gave way to acceptance, and eventually to love.

How did we get into this mess in the first place? It’s because of the classic mistake that many AI teams make: believing that good software development practices are luxuries they cannot afford. The proof is in 278 pull requests across 21 repos: roughly one substantive code review in the lot, dozens of PRs merged with none, an automated reviewer flagging an “8x cost increase, no guardrails” while a human waved it through with “lgtm,” and a single throwaway PR held open for a month to fire 61 GPU jobs.

Every percent of that imaginary million dollars maps to a specific, teachable engineering practice. My primary thesis in this talk is that MLOps isn’t a product you buy; it’s DevOps discipline applied to ML: reproducibility, cost ownership, and code review as a quality filter.


Key Takeaways

  • The most appealing aspect of this talk is the many stories you’ll hear about how a range of people - from novices to experts - react to administrative change.
  • Something as simple as machine telemetry can reveal a lot about shoddy craftsmanship.
  • Navigating the long tail of MLOps failures: when the problem at the peak is fixed, it just reveals a new peak: shorter than the previous one but harder to solve.
  • Everything, unsurprisingly, boils down to best practices, and they’re not a luxury. Readable code = modular code = testable code = deployable code.

About Me

I’m Jaidev Deshpande - a programmer and blogger who specializes in machine learning. I live in New Delhi. I currently head the MLOps effort at Aftershoot, a computer vision startup that helps photographers streamline their workflows with AI tools. I have more than a decade of experience in full stack development centered around ML/AI. You’re likely to run into me at various tech events.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures