Metrics are the fundamental unit of time series databases (TSDB). They consist of labels which denote dimensions e.g. http_status_code, url, etc. Critical insights require metrics to have both breadth (e.g. large number of labels) and depth (e.g. each label having a large number of unique values). This makes up metric cardinality. Higher cardinality == deeper insights. This talk will assume Prometheus-like systems for reference.
In today’s world of microservices, distributed services having dimensions like tenant, region, and service invariably lead to high cardinality. Unchecked cardinality growth can lead to adverse effects like higher resource consumption, slow load up of dashboards, alerting queries failing, observability systems going blank, reduced retention time, etc.
Avoiding these situations at the enterprise scale requires solutions to understand what is causing high cardinality and how to manage it. This is often an afterthought which ignores the tooling to answer these questions:
- How to find your TSDB cardinality limits?
- How do you know when your system is approaching these limits?
- When it does approach them, how to find out which metrics are contributing to it?
- How to dissect these metrics to find the labels which led to cardinality explosion?
- What actions need to be taken to fix this?
The default solutions hover around finding labels with high cardinality and dropping them. But I have seen that in customer production environments, blindly dropping labels gives a false sense of belief of fixing the problem without actually solving anything. I wrote a tool to analyse high cardinality systems, get insights out of them to answer these questions:
- What is the overall state of my system - are there any metrics approaching limits?
- Which labels of which metrics have the probability of causing cardinality explosion?
- If you choose to drop/aggregate these labels - what will the end state look like?
- How not to think about dropping labels as the default solution e.g. there are corner cases where dropping the label with highest cardinalty has zero impact on reduction.
The audience of this talk will have the following take aways:
- Fundamental knowledge to question assumptions on approaching systems with high cardinality metrics.
- A handy open source cardinality debugger/explorer well tested in customer production environments to analyze your Prometheus like TSDB systems and have the right numbers upfront which will help you choose where to invest your time - updating your instrumentation code, changing your metric agent configuration, etc.
Login to leave a comment
Zainab Bawa
@zainabbawa Editor & Promoter
Hi Preeti, summarizing the feedback from the editors:
High cardinality metrics is a problem in mostly environments that are on the path of scaling. This will be a value add talk, with practical applications for the audience.
The talk, though from Last9.io, is not talking about Last9's offerings, but an open source tool developed by them. Seems very interesting and useful.
Preeti
@preetidewani Submitter
Hi Sitaram Shelke,
Thank you for getting back to me.
Can you please help me with the next steps?
Sitaram Shelke
@sitaram Editor
Hi Preeti,
Please update here once you have prepared the document. I'd suggest not to stress to much on polishing, as we are still reviewing the proposal at this stage.
Sitaram Shelke
@sitaram Editor
Hello Preeti
Thanks for the detailed writeup.
The topic would suite nicely about the scaling and performance track.
I think it would be important for audience to learn something more than they could visit the documentation of the tool and apply the technique as is.