Previous proposalOlympus: Terraforming repeatable and extensible infrastructure at GO-JEK
Next proposalBuilding Distributed Systems in Distributed Teams
How we brought down costs by 60% without any code change.
Think about systems performing better, but in cheaper cost than usual.
Think about deploying a server infrastructure which should be : Highly scalable, Self-managed, Easy to Manage & Customise, Which can serve a huge volume traffic.
But same time Very cost-effective, Resilient to crashes & glitches, Shouldn’t cause any downtime, Shouldn’t require any application code change.
Sounds like an interesting problem statement to solve, right ?
In Answer, Yes We have a running, production-qualified solution around all above expectations by simply utilising some ready to use technologies and we brought down costs by 60% without any code change or any development effort.
With the growth in our players and respective growth in size of our infrastructure, we wanted to design an infrastructure which should be highly scalable, easy to manage but same time very cost optimal too. I am here to walkthrough our journey where we achieved this great goal and to share our experience and learnings.
- Our Infrastructure : How the infrastructure looks like and how many components are there. Some Day-1 smart infra decisions which helped to achieve our goal. Some advantages, some challenges
- Know thy constraints : Understand systems, tech-stack, data pipeline, logical isolations, applications and their nature in production environment, and what can be tuned for good.
- Availability vs Reliability : Understand how much your systems are reliable and fault tolerant, evaluate them to qualify for running on an automated scaling platform.
- Resiliency of Services : Dive deep into system level, where system can cater application and altogether they can interact with infrastructure to maintain Resiliency of services. This brings a system few more steps closer to deal efficiently against glitches, connection errors and ensures the user-experience.
What are all the pieces -
- Right Tools/Technologies in Arsenal : How and what all the tools and technologies we cherry-picked aligned to our goal.
- Compute/EC2 : How should be a unit compute system/instance, how to make selection considering cost and performance. what to keep within system image.
- Service Deployments : How we managed application builds, build deployment using bootstrap within instance, checkpoints and adding into production, graceful decommissioning along flushing.
- Autoscale - Out and IN (both) : Understanding traffic patterns and scheduling cluster expansion and shrink + Reactive Autoscaling based on cluster performance. Hacks to run on least possible cost without risking anything. Time Based, Performance based - How to leverage “And” not “Or”
Other cost components to consider while planning - CDN, Object Storage, Data transfer cost, Cross Availability Zone cost etc.
Architectural Decisions : - How to understand what is at risk and the real impact ? define safe limits, mitigation and automated handling in a controlled environment. e.g. ELK Architecture
Good to have understanding or experience with Linux, AWS EC2 Instances, AWS Pricing, Autoscaling, Application Deployment, Production Setup & best practices etc.
I am Neeraj Prem, DevOps Engineer at Moonfrog, India’s top mobile gaming company. I am responsible of managing a very Dynamic Infrastructure and it’s reliability, availability, performance, continuity and security. In my career, I have worked on variety of challenges related to IT infrastructure, Production Servers and their Automations to solve many interesting real business problems.