Cost Optimization: Not my infrastructure, but my architecture is the culprit.

This submission has been added to the schedule

Cost Optimization: Not my infrastructure, but my architecture is the culprit.

Submitted Oct 21, 2023

Come economic winter and as Infrastructure engineer your calendar is booked for multiple meetings/calls titled “Optimising Cloud Cost”. I am sure it sounds familiar. Everyone in the engineering teams prioritises cutting the cloud cost. But this is often a reactive and partial approach.
Why?
As an observation, we only optimise what is visible to us and pluck low-hanging fruits. What we often need to address are the issues with our architecture. Look into the architecture to consider the cost of other factors like security as first class, cost of scaling, and cost of over-engineering. If we focus on fixing them, the infrastructure cost reduction becomes a by-product. Eventually, it also leads to more predictable costs for your infrastructure.

This talk focuses on why architecture should not be made from Ivory Towers but more realistic to your business to keep the infrastructure cost in check. During this talk, I will touch upon the hidden costs often overlooked and try to explain them with examples and stories. We divide the cost into two categories: direct cost and indirect cost.

Direct cost:

Cost of optimisation for scale, used by none: As engineers, everyone wants to solve for scale. We built and optimised it for scale, with zero paying customers. Add more components to the fantastic architecture, which is sadly used by none.
Cost of not understanding the workload: Without understanding the workload, over-provisioning, auto-scaling horizontally or vertically without data points.
Cost of no signals, but all noise: Just because we have metrics, traces, and logs does not mean we will always use them. Example:
- Sending metrics with high cardinality does not improve your observability but increases your cost.
- No guard rails at your central logging infrastructure, which increases your storage, computing, and network.

Indirect cost:

Cost of no collaboration: When product engineering teams and infrastructure teams do not collaborate and build architectures in silos.
Cost of shiny tool syndrome - Introducing a “shiny new database” excites you because a cool company has solved when they reached a specific scale. The cost of your infrastructure will undoubtedly increase, but the engineering team effort required will be massive for a minimal gain.
Cost of overlooking security and compliance: After all the engineering effort, the product that runs on a particular infrastructure does not follow good practices. Example:
- Running components in the public network.
- No VPN for the internal tools like logging infrastructure or self-hosted CI/CD.
- Secret keys spread all across the application.
The cost is your reputation which trickles to your sales team and the inability to convert leads. Also, the cost to plan and move your stateless/stateful components.
Cost of heterogeneity: Multiple ways of doing one thing can exist. Some of the costs we should consider are maintenance and vendor lock-ins, which can be hard to quantify at times. Example:
- Running a similar workload on Kubernetes and running server-less functions on the cloud.

Why should you attend this talk?

If you are a product engineer or work as infrastructure/platform engineer this talk should help, some of the key takeaways are:

Understand the hidden factors for cloud cost optimization. How to treat it as continuous activity instead of one time effort.
Cloud cost optimization cannot be done in isolation. It is a joint effort between product engineers and infrastructure engineers. More empathy across teams :-).
Guidelines that can help make decisions between self-managed or hosted solutions.

SRE Conf 2023

Cost Optimization: Not my infrastructure, but my architecture is the culprit.

Direct cost:

Indirect cost:

Why should you attend this talk?

Comments