How we brought down costs by 60% without any code change.

Jun 2019

17 Mon

18 Tue

19 Wed

20 Thu

21 Fri 08:45 AM – 05:40 PM IST

22 Sat 09:00 AM – 05:30 PM IST

23 Sun

NIMHANS Convention Centre, Bangalore

How we brought down costs by 60% without any code change.

Submitted Jan 15, 2019

Section: Crisp talk of 20 mins duration Technical level: Intermediate

Think about systems performing better, but in cheaper cost than usual.

Think about deploying a server infrastructure which should be : Highly scalable, Self-managed, Easy to Manage & Customise, Which can serve a huge volume traffic.

But same time Very cost-effective, Resilient to crashes & glitches, Shouldn’t cause any downtime, Shouldn’t require any application code change.

Sounds like an interesting problem statement to solve, right ?

In Answer, Yes We have a running, production-qualified solution around all above expectations by simply utilising some ready to use technologies and we brought down costs by 60% without any code change or any development effort.

With the growth in our players and respective growth in size of our infrastructure, we wanted to design an infrastructure which should be highly scalable, easy to manage but same time very cost optimal too. I am here to walkthrough our journey where we achieved this great goal and to share our experience and learnings.

Outline

Our Infrastructure : How the infrastructure looks like and how many components are there. Some Day-1 smart infra decisions which helped to achieve our goal. Some advantages, some challenges
Know thy constraints : Understand systems, tech-stack, data pipeline, logical isolations, applications and their nature in production environment, and what can be tuned for good.
Availability vs Reliability : Understand how much your systems are reliable and fault tolerant, evaluate them to qualify for running on an automated scaling platform.
Resiliency of Services : Dive deep into system level, where system can cater application and altogether they can interact with infrastructure to maintain Resiliency of services. This brings a system few more steps closer to deal efficiently against glitches, connection errors and ensures the user-experience.

What are all the pieces -

Right Tools/Technologies in Arsenal : How and what all the tools and technologies we cherry-picked aligned to our goal.
Compute/EC2 : How should be a unit compute system/instance, how to make selection considering cost and performance. what to keep within system image.
Service Deployments : How we managed application builds, build deployment using bootstrap within instance, checkpoints and adding into production, graceful decommissioning along flushing.
Autoscale - Out and IN (both) : Understanding traffic patterns and scheduling cluster expansion and shrink + Reactive Autoscaling based on cluster performance. Hacks to run on least possible cost without risking anything.
Time Based, Performance based - How to leverage “And” not “Or"

Other cost components to consider while planning

CDN, Object Storage, Data transfer cost, Cross Availability Zone cost etc.

Architectural Decisions :

How to understand what is at risk and the real impact ? define safe limits, mitigation and automated handling in a controlled environment.
e.g. ELK Architecture

Requirements

Good to have understanding or experience with Linux, AWS EC2 Instances, AWS Pricing, Autoscaling, Application Deployment, Production Setup & best practices etc.

Speaker bio

I am Neeraj Prem, DevOps Engineer at Moonfrog, India’s top mobile gaming company. I am responsible of managing a very Dynamic Infrastructure and it’s reliability, availability, performance, continuity and security. In my career, I have worked on variety of challenges related to IT infrastructure, Production Servers and their Automations to solve many interesting real business problems.

Slides

https://docs.google.com/presentation/d/1o1P7g_4nQLbPUePvEqp47cc_871M3Wn9OoLC35_VTdA/edit?usp=sharing

Rootconf 2019