Rootconf Pune edition

On security, network engineering and distributed systems

Tickets

10x faster query performance with Jaeger, Prometheus and Correlation!

Submitted by Goutham Veeramachaneni (@gouthamve) on Monday, 15 July 2019

Section: Full talk (40 mins) Category: Distributed systems Status: Rejected

Abstract

We hack on cortex, a Open-Source CNCF project for distributed Prometheus, and run it in production. But as we started adding scale, we noticed poor query performance. We found ourselves adding new metrics on each rollout to test our theories, many a time shooting in the dark, only to have our assumptions invalidated after a lot of experimentation. We then decided to turn to Jaeger and things instantly improved.

In this talk, we will introduce you to distributed-tracing, show you how we implemented tracing with Jaeger, correlated the data with Prometheus metrics and logs to understand which subsystems/services/function-calls were slow, and optimised those to achieve 10x latency improvements. We will also talk about the issues we faced scaling jaeger in a multi-cluster scenario, and how we used Envoy to solve the problem of dropped spans. We will share the jaeger-mixin we developed to monitor jaeger and also talk about how we are evangelising Jaeger internally to other teams and onboarding them.

Outline

This talk will take the audience through the entire journey of the value that Jaeger / distributed tracing can add, the scaling problems they could hit and also how to evangelise Jaeger internally in their company.

This talk expands on https://bit.ly/2YNMWRJ (official Jaeger blogpost about how Grafana Labs uses Jaeger). We will walk through the latency issues we were facing in Cortex and how we leveraged Jaeger to solve the issues. But while doing that, we will also show how it interplayed with our Prometheus and logging setups and how jaeger fits right into the workflow. With Prometheus, we have our RED dashboards that highlight which services are slow, and we then use jaeger to drill down and investigate which functions are taking how long, if we’re making too many calls, if we could batch calls together, etc. We then verify the impact of using Jaeger and Prometheus after rolling out the changes.

We were also seeing very slow queries once in a while, that weren’t very obvious when using dashboards. We then started logging the traceID in our request logs. We picked the requests that took too long, and used their traceIDs to probe for issues with that particular request. This required that we trace every single query fired, which lead to scaling issues with Jaeger. We were able to resolve these by using the new grpc changes in Jaeger coupled with envoy for load-balancing. After deriving all this value from Jaeger, we started on-boarding other teams in Grafana on to Jaeger and we will share the challenges we faced and how we fixed them.

Speaker bio

Goutham is a developer from India who started his journey as an infra intern at a large company where he worked on deploying Prometheus. After that initial encounter, he started contributing to Prometheus and interned with CoreOS, working on Prometheus’ new storage engine. He is now a maintainer for TSDB, the engine behind Prometheus 2.0. He now works at Grafana Labs on open-source observability tools. When not hacking away, he is on his bike adding miles and hurting his bum.

Links

Comments

  •   Zainab Bawa (@zainabbawa) Reviewer 8 months ago

    Thanks for the proposal, Goutham. There are couple of comments for the proposal:

    1. We have received another proposal from Bhavin Gandhi on distributed tracing in FaaS. https://hasgeek.com/rootconf/2019-pune/proposals/implementing-distributed-tracing-in-faas-NToMdV5grDFtACY5fvGvtm
    2. We have covered the introduction to D-trace earlier, in Rootconf 2016: https://hasgeek.tv/rootconf/2018-day-2/1509-distributed-tracing-with-jaeger-at-scale Your proposal appears to be an introduction to D-trace, which, because it has already been covered in previous editions of Rootconf, will not be considered further.
    3. Given that the case study has been documented in the blog post, what is the value addition of the talk (for the community) beyond the documentation?

    Look forward to your responses.

  •   Goutham Veeramachaneni (@gouthamve) Proposer 8 months ago (edited 8 months ago)
    1. Interesting proposal there. For those interested in implementing distributed tracing to complex systems and challenges they can face, that FaaS proposal would be super interesting. My proposal is more high-level as to a particular use-case for tracing and how you can complement tracing with other signals to get more benefits. More or less orthogonal proposals with little overlap, beyond the intro.
    2. Hrm, what might seem like lots of overlap, but I would consider this as an extension talk to the 2016 talk. I can modify it with the scaling challenges and practical implementation details required to correlate with other signals (matching the labels of traces and those of metrics, to make sure you can drill-down into the right traces from the right dashboard, etc.). I can also show the kind of slowness we’ve optimised away (batching of requests, retries and failing quickly, etc.)
    3. Well, the case-study is super high-level one which says, hey, they faced a problem, they deployed jaeger and they fixed problems as the gist of it. While I intend this talk to be more about how we fixed those problems.

    I can see the point about the 2016 talk, and I honestly expect this to be mainly useful for people who have already attended that talk. This will be less of an intro and this is jaeger, and more of a, so you have jaeger, mix it with metrics and logs to get more out of it.

Login with Twitter or Google to leave a comment