SRE Conf 2023

SRE Conf 2023

Availability and reliability 24/7- the SRE life

Tickets
  • Select Tickets
  • Payment
  • invoice
  • Attendee details

Membership

Rootconf annual membership

Rootconf membership is valid for one year - 12 months. The member get the following benefits:

  • Participation in all online peer review sessions.
  • Access to all recordings from online reviews.
  • Priority access to all offline meet-ups and online workshops hosted by Rootconf during the one year period.
  • Access to Rootconf’s annual conference on 16 and 17 May 2025 in Bangalore - in-person and virtually (via live stream).

Corporate Members-only benefits (bulk ticket purchase):

  • Transfer of memberships across individuals in the organization.

Memberships can be cancelled within 1 hour of purchase.

₹4300

×

Sale at this price closes on December 31, 2025

Bulk purchase (10+)

Bulk purchase (5+)

Total ₹0

Cancellation and refund policy

Memberships can be cancelled within 1 hour of purchase

Workshop tickets can be cancelled or transferred upto 24 hours prior to the workshop.

For further queries, please write to us at support@hasgeek.com or call us at +91 7676 33 2020.

Preeti

Preeti

@preetidewani

Deep dive into analyzing high cardinality metrics

Submitted Sep 30, 2023

Introduction

Metrics are the fundamental unit of time series databases (TSDB). They consist of labels which denote dimensions e.g. http_status_code, url, etc. Critical insights require metrics to have both breadth (e.g. large number of labels) and depth (e.g. each label having a large number of unique values). This makes up metric cardinality. Higher cardinality == deeper insights. This talk will assume Prometheus-like systems for reference.

Problem

In today’s world of microservices, distributed services having dimensions like tenant, region, and service invariably lead to high cardinality. Unchecked cardinality growth can lead to adverse effects like higher resource consumption, slow load up of dashboards, alerting queries failing, observability systems going blank, reduced retention time, etc.

Avoiding these situations at the enterprise scale requires solutions to understand what is causing high cardinality and how to manage it. This is often an afterthought which ignores the tooling to answer these questions:

  1. How to find your TSDB cardinality limits?
  2. How do you know when your system is approaching these limits?
  3. When it does approach them, how to find out which metrics are contributing to it?
  4. How to dissect these metrics to find the labels which led to cardinality explosion?
  5. What actions need to be taken to fix this?

Solution

The default solutions hover around finding labels with high cardinality and dropping them. But I have seen that in customer production environments, blindly dropping labels gives a false sense of belief of fixing the problem without actually solving anything. I wrote a tool to analyse high cardinality systems, get insights out of them to answer these questions:

  1. What is the overall state of my system - are there any metrics approaching limits?
  2. Which labels of which metrics have the probability of causing cardinality explosion?
  3. If you choose to drop/aggregate these labels - what will the end state look like?
  4. How not to think about dropping labels as the default solution e.g. there are corner cases where dropping the label with highest cardinalty has zero impact on reduction.

Benefits

The audience of this talk will have the following take aways:

  1. Fundamental knowledge to question assumptions on approaching systems with high cardinality metrics.
  2. A handy open source cardinality debugger/explorer well tested in customer production environments to analyze your Prometheus like TSDB systems and have the right numbers upfront which will help you choose where to invest your time - updating your instrumentation code, changing your metric agent configuration, etc.

Comments

Login to leave a comment

  • SS

    Sitaram Shelke

    @sitaram Editor

    Hello Preeti
    Thanks for the detailed writeup.
    The topic would suite nicely about the scaling and performance track.

    1. Can you please include the links to the tool you mentioned ?
    2. Also please mention the required preliminary knowledge needed by the audience to understand what you will be talking about.
    3. Will there be a live/recorded demo of the tool you mention?
      I think it would be important for audience to learn something more than they could visit the documentation of the tool and apply the technique as is.
    Posted 1 year ago
Hybrid access (members only)

Hosted by

We care about site reliability, cloud costs, security and data privacy