Cortex: Horizontally Scalable, Distributed Prometheus
Submitted by Goutham Veeramachaneni (@gouthamve) on Thursday, 14 February 2019
Technical level: Intermediate Status: Rejected
In this talk we’ll present a horizontally scalable, distributed, Prometheus API-compatible monitoring system called Cortex. Cortex was built to offer a different solution to Prometheus HA and virtually infinite retention. We’ll discuss its architecture, tradeoffs and evolution, with special reference given to the distributed systems algorithms use to provide failure tolerance and scalability.
Cortex turns a lot of the Prometheus architectural assumptions on its head, by marrying a scale-out PromQL query engine with a storage layer based on NOSQL databases such as Bigtable, DynamoDB and Cassandra. We have disaggregated the Prometheus binary into a microservices-style architecture, with separate services for query, ingest, alerting and recording rules. By designing all these services as fungible replicas, this solution can be scaled out with ease and failure of any individual replica can be dealt with gracefully.
Cortex is a CNCF Project and has been in production for over two years now, and the talk will cover some of the many things we have learnt along the way.
This talk will help the audience understand what Cortex is, how it relates to Prometheus and how to get started with it. The lack of horizontal scalability, replication and long term storage have been cited by some as a reason not to choose Prometheus; Cortex aims to provide a version of Prometheus with these features, removing some of the reasons against adoption.
We start with Prometheus, it’s limitations and then dive into the motivations and architecture behind Cortex. We then talk about users and use-cases before we finally talk about the future of the project.
Goutham is a developer from India who started his journey as an infra intern at large company where he worked on deploying Prometheus. After that initial encounter, he started contributing to Prometheus and interned with CoreOS, working on Prometheus’ new storage engine. He is now a maintainer for TSDB, the engine behind Prometheus 2.0. He now works at Grafana Labs on open-source observability tools. When not hacking away, he is either on his bike, or is binge watching GCN!