Cost Optimisation in ECS: Integrating Spot Instances at Scale

This submission has been added to the schedule

Cost Optimisation in ECS: Integrating Spot Instances at Scale

Submitted Nov 26, 2025

Session type: Talk (30 mins)

Description
This session walks through how we introduced EC2 Spot Instances across thousands of production ECS services in a large, multi-cluster environment. We’ll cover the architecture and the objective criteria used to determine Spot eligibility, along with the automated system we built to manage it, powered by Lambda functions, periodic evaluators, service-level checks, and a central orchestrator that could safely toggle Spot for each workload.

You’ll see how we balanced substantial cost savings with strict reliability requirements in a high-traffic ecosystem. From handling Spot interruptions and fluctuating AWS capacity to validating graceful shutdown paths, LB deregistration behavior, and computing a custom “Spot Placement Score,” this session demonstrates how to adopt Spot at scale using automation, guardrails, and data-driven rollout logic.

Takeaways

A scalable framework for Spot adoption: You’ll learn criteria, automation patterns, and scoring mechanisms that enable safely onboarding thousands of ECS services onto Spot with no manual effort.
How to maintain reliability at scale: Gain insight into the guardrails, health checks, and architectural patterns required to make Spot work in large, business-critical production environments where even a small increase in failure rate is unacceptable.

Audience

Cloud / DevOps engineers managing large EC2 fleets who want to adopt Spot confidently at scale.
SRE and platform engineering teams responsible for high-availability compute platforms.
Engineering managers and architects evaluating cost-optimization strategies for large distributed systems.
Organizations spending heavily on AWS compute and looking for scalable, low-risk approaches to reduce costs across a broad service ecosystem.

Speaker Bio
Aakash, is a Senior Platform Engineer at Deliveroo. Being part of the Network, Edge and Compute(NEC) team, He’s responsible for building a scalable, reliable, cost-eﬀective and engineer friendly platform.

Link to elevator pitch: https://drive.google.com/file/d/1l4NPdCQwh5TbKtRFQd4dHVl4raOnMIkV/view?usp=sharing

Platform Engineering meet-up - Jan 8

Cost Optimisation in ECS: Integrating Spot Instances at Scale

Comments