Rootconf 2018

On scaling infrastructure and operations

12 People managing a Million Drivers just with right Alerting and good Monitoring

Submitted by Tilak Lodha (@tilaklodha) on Saturday, 10 March 2018

videocam_off

Technical level

Beginner

Section

Crisp Talk

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +6

Abstract

Downtimes are part of every system and infrastructure, all you can do is to either reduce the duration of downtime or save it at the right moment with right alerting.

KeyIdeas:
- Identify the critical flows of the service. Stop everything else and start doing audit work on those flows. Find dependencies and add metrics and alerts accordingly. This will save you a lot of time spent on prod issues. - Try to find the problem within first 15 minutes of your downtime, other wise it will put back pressure on other services and you wont know where the problem is. - Your monitoring dashboards should be a flow of the service in itself. So in crisis, you know till this everything was working fine and this part is the culprit. - This is something we did. We develope our own anamoly detection algorithm, based on your use case, will be talking about this also.

The talk will cover mostly on these topics, why we need all of these and how it has helped us along the time, either in reducing our system downtime or preventing it totally. How we can find each metric a important one and what can be the right alerts. Also, I will talk about the anamoly detection algorithm we use and how it has saved us from generating false alerts and missing the right ones.

Anyone can attend the talk, but people with little knowledge of TICK stack can be the right audience for it.

Outline

At Go-Jek, we have so many services but one of the core service, whose responsibility is to find a driver for each booking customer makes, has a very different scale, managing 3 million bookings and 1 million drivers everyday that is a tough job. But still, somehow, we do that task everyday. Ofcourse, we have our glitches and downtime, but with time you gain the experience to put right alerts and monitoring in your service and infrastructure to make sure you save yourselves from a mayhem and submitting a RCA of the event :P

The service I am talking about reports 1500 metrics to Grafana and these are only service level metrics, the
system level metrics are way more than that like network, cpu, memory and other. And just for the service we have around 200 alerts triggering at any point stating something is wrong. So how do we do that?

  • There will be a start point of your service, is it dependent on something? Put a metric there. The dependency can be of any type rabbitmq, postgres or it can be a http request. Put alert on everything. A warn level alert and a critical level alert with right threshold values. This will handle the start of your service, similarly, go ahead and watch for other flows.

  • Create a backup of everything, like your database is behaving weird, make a slave of it and be prepare to use it as a master at any point of time. Your redis server is not healthy, use twemproxy. In the complete talk I will be telling about how creating a backup plan made us on marginally safer side then everything else.

  • I will also discuss about our anamoly detection algorithm, which is based on pure statistics, and how we came up with that and testing of it based on our used case. And as the time grews, and it has more data, the algorithm keeps on refining itself. Believe me, this algorithm is very simple but has saved us a lot of times.

  • How we manage the system metrics, to configure our infratructure, something on iostat what should be the optimum values of disk, cpu, load, average while running a service based on different language.

Along with some stories from Go-Jek, some funny test alerts in middle of night and some which made us to focus on our alerting and monitoring more, and how we audited our systems.

(Got to know about the conference in last minute. Couldn’t find time to prepare any slides, and submitting this proposal on hustle. Willing to do changes if slides are needed, or any suggestion on the content)

Requirements

Nothing specific needed from the participants.

Speaker bio

Tilak is working as Product Engineer, at Go-Jek(Indonesia’s first Unicorn). He studied Computer Science from IIT Indore and graduated in 2017. Since then working on diffrent tech stack and many languagues including Clojure, Golang, etc. His hobbies include reading books, and listening music and he recently started blogging.

Slides

https://docs.google.com/presentation/d/1zedMBVArBgM-eBxfLx3Bax8qGyaCdDlMDC0nXfmKSjM/edit?usp=sharing

Comments

  • 2
    Ramanan Balakrishnan (@ramananbalakrishnan) 8 months ago

    Seems like an interesting topic. I think you can still edit your proposal, so if you have slides, do update the link.

    Regarding content, apart from proactively identifying metrics to monitor, do you also have standardized runbooks for reacting to outages? During an outage, are rollbacks to last working version auotomated, while an audit is being carried out?

    I am quite sure you might have a lot more content for a full 40-min talk rather than a short one (hot standbys, blue-green deployments, …)

    • 1
      Tilak Lodha (@tilaklodha) Proposer 8 months ago

      Hey, Thanks for the comment. I know the proposal could have been better. But for now, I am struggling on what will be the best context for the talk. I am preparing the slides, will upload once I finish everything. I want it to be a small and high quality talk on Monitoring and Alerting and at the same time improvising on the proposal.

  • 1
    Tilak Lodha (@tilaklodha) Proposer 8 months ago

    Thanks for showing interest in the proposal. Here are the draft slides: https://docs.google.com/presentation/d/1zedMBVArBgM-eBxfLx3Bax8qGyaCdDlMDC0nXfmKSjM/edit?usp=sharing

Login with Twitter or Google to leave a comment