Jan 2026
5 Mon
6 Tue
7 Wed
8 Thu 04:15 PM – 07:15 PM IST
9 Fri
10 Sat
11 Sun
Aakash Singhal
@aakashsinghal
Submitted Nov 26, 2025
Description
This session walks through how we introduced EC2 Spot Instances across thousands of production ECS services in a large, multi-cluster environment. We’ll cover the architecture and the objective criteria used to determine Spot eligibility, along with the automated system we built to manage it, powered by Lambda functions, periodic evaluators, service-level checks, and a central orchestrator that could safely toggle Spot for each workload.
You’ll see how we balanced substantial cost savings with strict reliability requirements in a high-traffic ecosystem. From handling Spot interruptions and fluctuating AWS capacity to validating graceful shutdown paths, LB deregistration behavior, and computing a custom “Spot Placement Score,” this session demonstrates how to adopt Spot at scale using automation, guardrails, and data-driven rollout logic.
Takeaways
A scalable framework for Spot adoption: You’ll learn criteria, automation patterns, and scoring mechanisms that enable safely onboarding thousands of ECS services onto Spot with no manual effort.
How to maintain reliability at scale: Gain insight into the guardrails, health checks, and architectural patterns required to make Spot work in large, business-critical production environments where even a small increase in failure rate is unacceptable.
Audience
Speaker Bio
Aakash, is a Senior Platform Engineer at Deliveroo. Being part of the Network, Edge and Compute(NEC) team, He’s responsible for building a scalable, reliable, cost-effective and engineer friendly platform.
Link to elevator pitch: https://drive.google.com/file/d/1l4NPdCQwh5TbKtRFQd4dHVl4raOnMIkV/view?usp=sharing
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}