Deep dive into analyzing high cardinality metrics

This submission has been added to the schedule

Deep dive into analyzing high cardinality metrics

Submitted Sep 30, 2023

Introduction

Metrics are the fundamental unit of time series databases (TSDB). They consist of labels which denote dimensions e.g. http_status_code, url, etc. Critical insights require metrics to have both breadth (e.g. large number of labels) and depth (e.g. each label having a large number of unique values). This makes up metric cardinality. Higher cardinality == deeper insights. This talk will assume Prometheus-like systems for reference.

Problem

In today’s world of microservices, distributed services having dimensions like tenant, region, and service invariably lead to high cardinality. Unchecked cardinality growth can lead to adverse effects like higher resource consumption, slow load up of dashboards, alerting queries failing, observability systems going blank, reduced retention time, etc.

Avoiding these situations at the enterprise scale requires solutions to understand what is causing high cardinality and how to manage it. This is often an afterthought which ignores the tooling to answer these questions:

How to find your TSDB cardinality limits?
How do you know when your system is approaching these limits?
When it does approach them, how to find out which metrics are contributing to it?
How to dissect these metrics to find the labels which led to cardinality explosion?
What actions need to be taken to fix this?

Solution

The default solutions hover around finding labels with high cardinality and dropping them. But I have seen that in customer production environments, blindly dropping labels gives a false sense of belief of fixing the problem without actually solving anything. I wrote a tool to analyse high cardinality systems, get insights out of them to answer these questions:

What is the overall state of my system - are there any metrics approaching limits?
Which labels of which metrics have the probability of causing cardinality explosion?
If you choose to drop/aggregate these labels - what will the end state look like?
How not to think about dropping labels as the default solution e.g. there are corner cases where dropping the label with highest cardinalty has zero impact on reduction.

Benefits

The audience of this talk will have the following take aways:

Fundamental knowledge to question assumptions on approaching systems with high cardinality metrics.
A handy open source cardinality debugger/explorer well tested in customer production environments to analyze your Prometheus like TSDB systems and have the right numbers upfront which will help you choose where to invest your time - updating your instrumentation code, changing your metric agent configuration, etc.

SRE Conf 2023

Deep dive into analyzing high cardinality metrics

Introduction

Problem

Solution

Benefits

Comments