10x faster query performance with Jaeger, Prometheus and Correlation!

Sep 2019

16 Mon

17 Tue

18 Wed

19 Thu

20 Fri

21 Sat 08:55 AM – 06:20 PM IST

22 Sun

Make a submission

Accepting submissions till 21 Aug 2019, 10:30 AM

St. Laurn Hotel, Pune

Tickets

##About Rootconf Pune:

Rootconf Pune is a conference for:

DevOps engineers
Site Reliability Engineers (SRE)
Security and DevSecOps professionals
Software engineers
Network engineers

The Pune edition will cover talks on:

InfoSec and application security for DevOps programmers
DNS and TLS 1.3
SRE and distributed systems
Containers and scaling

Speakers from Flipkart, Hotstar, Red Hat, Trusting Social, Appsecco, InfraCloud Technologies, among others, will share case studies from their experiences of building security, SRE and Devops in their organizations.

##Workshops:

Two workshops will be held before and after Rootconf Pune:

Full-day Prometheus training workshop on 20 September, conducted by Goutham V, contributor to Prometheus and developer at Grafana Labs. Details about the workshop are available here: https://hasgeek.com/rootconf/2019-prometheus-training-pune/
Full-day DNS deep dive workshop on 22 September by Ashwin Murali: https://hasgeek.com/rootconf/2019-dns-deep-dive-workshop-pune/

##Event venue:
Rootconf Pune will be held on 21 September at St. Laurn Hotel, Koregaon Park, Pune-411001.

#Sponsors:

Click here to view the Sponsorship Deck.
Email sales@hasgeek.com for bulk ticket purchases, and sponsoring the above Rootconf Series.

Rootconf Pune 2019 sponsors:

#Platinum Sponsor

#Bronze Sponsors

#Community Partner

##To know more about Rootconf, check out the following resources:

hasgeek.com/rootconf
hasgeek.com/rootconf/2019
https://hasgeek.tv/rootconf/2019

For information about the event, tickets (bulk discounts automatically apply on 5+ and 10+ tickets) and speaking, call Rootconf on 7676332020 or write to info@hasgeek.com

Hosted by

Rootconf

Rootconf is a community-funded platform for activities and discussions on the following topics: Site Reliability Engineering (SRE). Infrastructure costs, including Cloud Costs - and optimization. Security - including Cloud Security. more

All submissions

Previous Next

10x faster query performance with Jaeger, Prometheus and Correlation!

Submitted Jul 15, 2019

Section: Full talk (40 mins) Category: Distributed systems

We hack on cortex, a Open-Source CNCF project for distributed Prometheus, and run it in production. But as we started adding scale, we noticed poor query performance. We found ourselves adding new metrics on each rollout to test our theories, many a time shooting in the dark, only to have our assumptions invalidated after a lot of experimentation. We then decided to turn to Jaeger and things instantly improved.

In this talk, we will introduce you to distributed-tracing, show you how we implemented tracing with Jaeger, correlated the data with Prometheus metrics and logs to understand which subsystems/services/function-calls were slow, and optimised those to achieve 10x latency improvements. We will also talk about the issues we faced scaling jaeger in a multi-cluster scenario, and how we used Envoy to solve the problem of dropped spans. We will share the jaeger-mixin we developed to monitor jaeger and also talk about how we are evangelising Jaeger internally to other teams and onboarding them.

Outline

This talk will take the audience through the entire journey of the value that Jaeger / distributed tracing can add, the scaling problems they could hit and also how to evangelise Jaeger internally in their company.

This talk expands on https://bit.ly/2YNMWRJ (official Jaeger blogpost about how Grafana Labs uses Jaeger). We will walk through the latency issues we were facing in Cortex and how we leveraged Jaeger to solve the issues. But while doing that, we will also show how it interplayed with our Prometheus and logging setups and how jaeger fits right into the workflow. With Prometheus, we have our RED dashboards that highlight which services are slow, and we then use jaeger to drill down and investigate which functions are taking how long, if we’re making too many calls, if we could batch calls together, etc. We then verify the impact of using Jaeger and Prometheus after rolling out the changes.

We were also seeing very slow queries once in a while, that weren’t very obvious when using dashboards. We then started logging the traceID in our request logs. We picked the requests that took too long, and used their traceIDs to probe for issues with that particular request. This required that we trace every single query fired, which lead to scaling issues with Jaeger. We were able to resolve these by using the new grpc changes in Jaeger coupled with envoy for load-balancing. After deriving all this value from Jaeger, we started on-boarding other teams in Grafana on to Jaeger and we will share the challenges we faced and how we fixed them.

Speaker bio

Goutham is a developer from India who started his journey as an infra intern at a large company where he worked on deploying Prometheus. After that initial encounter, he started contributing to Prometheus and interned with CoreOS, working on Prometheus’ new storage engine. He is now a maintainer for TSDB, the engine behind Prometheus 2.0. He now works at Grafana Labs on open-source observability tools. When not hacking away, he is on his bike adding miles and hurting his bum.

Comments

Sep 2019

16 Mon

17 Tue

18 Wed

19 Thu

20 Fri

21 Sat 08:55 AM – 06:20 PM IST

22 Sun

Make a submission

Accepting submissions till 21 Aug 2019, 10:30 AM

St. Laurn Hotel, Pune

Hosted by

Rootconf

Rootconf Pune edition

Rootconf Pune 2019 sponsors:

10x faster query performance with Jaeger, Prometheus and Correlation!

Outline

Speaker bio

Links

Comments