Optimizing costs of cloud infrastructures

Optimizing costs of cloud infrastructures

Practical case studies from enterprises and startups

Tickets

Loading…

  • Tips and tricks to optimize cloud costs - and how to stay optimized by design

    Zainab Bawa

    @zainabbawa

    Pravanjan Choudhury, CTO at Capillary Technologies and co-founder at Facets.Cloud, made a presentation on the topic of how to stay optimized by design on 26 November. This session followed his previous presentation on the mental models for thinking about cloud costs.
    In this session, Pravanjan suggests some tips and tricks for optimization, and how to design your systems to maximize optimization. These include:

    1. Tagging
    2. How to use different instruments of cost visibility in AWS effectively
    3. Efficient use of machine reservations
    4. Tying costs with workloads

    Let’s dive deeper into each of these suggestions. (You can skip to the section that is most relevant for your needs.)

    1. Tagging

    • To achieve tagging, cloud resources need to be tagged much more intentionally. The goal is to know the end consumers who are going to get the scores to use. Tagging is done by breaking the resources down by decentralized attribution.
    • Split the products - or split your AWS account into one for non-production and one for production – to makes attribution easier. It removes the need of excessive tags and also facilitates other cloud best practices like more security. A decentralized setup for cost optimization - by design - will have all the tagging inside your infrastructure as code.
    • In instances where AWS tagging is not enough, use a Kubernetes cluster and create a community share tag. Wherever you can’t attribute or get shared platforms which are sharing the same underlying platform, there you have to label it in a way to attribute installation and then proportionate to the shared cost.

    2. Using different instruments of cost visibility in AWS

    • Using AWS cost categories, break down your costs to detect baseline and anomalies for each of the teams or product groups. AWS cost categories lets you create dimensions which are business-centric and not asset-centric. Create rules for each of these dimensions, and then add multiple rules as tags.
    • If you also have shared tags between the products, get the data out of the cost explorer, or use the base AWS detailed line items to tie it up with their utilization from Prometheus or Grafana to create the tooling yourself.
    • Whenever an anomaly happens, the dimension in which it is happening may not be captured. In such a case, it is good to break down dimensions both asset wise and in business metrics combinations, and also have a category which is unclassified.

    3. Machine reservations - discount versus optionality versus liability

    • Reservations from AWS specifically have been of different types. Earlier they assumed machine reservations. Now there are savings plans. Typically, a mental model of reservation is a combination of discounts, commitment, and optionality. You will need to know what your future cost and projections are going to look like. Based on this estimate, commit for one to three years. At the same time, do not be too conservative. Else, you will lose out on reservations.
    • The savings plan is more simplified when going from machines to dollars per hour. This also means that your returns decrease with more commitment. You can use a cost explorer to rationalize the number provided by the AWS recommendation engine, and then come up with a number which can give you the residual savings plan.

    4. Tying costs with workloads

    • Real-time workload:

      • If you are stateless systems cloud ready, use Spot for the minimum workload that is required.
      • Auto scaling keeps the costs down where the web traffic can change significantly.
      • Stateful sets might cause disruptions or require time to reboot. Measuring disk I/O utilization can be helpful.
      • If it is High-Availability, you should be conscious of Availability-Zone transfer and latency.
    • Batch workloads:

      • You can use Spot heavily. But since it is time-bound you shouldn’t use for The workload that has to finish in a few hours.
      • It is best to use a managed system like Elastic MapReduce or Databricks for ETL.
      • There are multiple instance options that are memory heavy. Use storage tiers to get some cost optimization benefits like on S3, and store large amounts of data.
    • ML workload:

      • Spot can be used heavily for Machine Learning workloads.
      • Managed systems are a lot better as you need to pack a lot of things to the Amazon Machine Images and you can build tooling on top of it.
      • Multiple instances option is not always applicable here because sometimes they can be CPU hungry.

    In conclusion, Pravanjan discussed how to manage data transfer costs in multiple EKS clusters. He suggested first capturing the application intent. If there are services and associated databases which can really recover a failure, pin them to a particular region. This way, the dependency of that service and associated ones are limited to a particular region.

    This summary is compiled by Anwesha Sen, with review Pravanjan Choudhury, and edits by Prashant Dhanke and Zainab Bawa.

Hybrid access (members only)

Hosted by

We care about site reliability, cloud costs, security and data privacy