10x faster query performance with Jaeger, Prometheus and Correlation!
We hack on cortex, a Open-Source CNCF project for distributed Prometheus, and run it in production. But as we started adding scale, we noticed poor query performance. We found ourselves adding new metrics on each rollout to test our theories, many a time shooting in the dark, only to have our assumptions invalidated after a lot of experimentation. We then decided to turn to Jaeger and things instantly improved.
In this talk, we will introduce you to distributed-tracing, show you how we implemented tracing with Jaeger, correlated the data with Prometheus metrics and logs to understand which subsystems/services/function-calls were slow, and optimised those to achieve 10x latency improvements. We will also talk about the issues we faced scaling jaeger in a multi-cluster scenario, and how we used Envoy to solve the problem of dropped spans. We will share the jaeger-mixin we developed to monitor jaeger and also talk about how we are evangelising Jaeger internally to other teams and onboarding them.
This talk will take the audience through the entire journey of the value that Jaeger / distributed tracing can add, the scaling problems they could hit and also how to evangelise Jaeger internally in their company.
This talk expands on https://bit.ly/2YNMWRJ (official Jaeger blogpost about how Grafana Labs uses Jaeger). We will walk through the latency issues we were facing in Cortex and how we leveraged Jaeger to solve the issues. But while doing that, we will also show how it interplayed with our Prometheus and logging setups and how jaeger fits right into the workflow. With Prometheus, we have our RED dashboards that highlight which services are slow, and we then use jaeger to drill down and investigate which functions are taking how long, if we’re making too many calls, if we could batch calls together, etc. We then verify the impact of using Jaeger and Prometheus after rolling out the changes.
We were also seeing very slow queries once in a while, that weren’t very obvious when using dashboards. We then started logging the traceID in our request logs. We picked the requests that took too long, and used their traceIDs to probe for issues with that particular request. This required that we trace every single query fired, which lead to scaling issues with Jaeger. We were able to resolve these by using the new grpc changes in Jaeger coupled with envoy for load-balancing. After deriving all this value from Jaeger, we started on-boarding other teams in Grafana on to Jaeger and we will share the challenges we faced and how we fixed them.
Goutham is a developer from India who started his journey as an infra intern at a large company where he worked on deploying Prometheus. After that initial encounter, he started contributing to Prometheus and interned with CoreOS, working on Prometheus’ new storage engine. He is now a maintainer for TSDB, the engine behind Prometheus 2.0. He now works at Grafana Labs on open-source observability tools. When not hacking away, he is on his bike adding miles and hurting his bum.