12 People managing a Million Drivers just with right Alerting and good Monitoring

May 2018

7 Mon

8 Tue

9 Wed

10 Thu 08:15 AM – 05:25 PM IST

11 Fri 08:30 AM – 06:20 PM IST

12 Sat

13 Sun

Make a submission

NIMHANS Convention Centre, Bengaluru

##About Rootconf 2018 and who should attend:

Rootconf is India’s best conference on DevOps, SRE and IT infrastructure. Rootconf attracts systems and operations engineers to share real-world knowledge about building reliable systems.

The 2018 edition is a single track conference. Day 1 – 10 May – features talks on security. Colin Charles (chief evangelist at Percona Foundation), Pukhraj Singh (former national cybersecurity manager at UIDAI), Shamim Reza (open source enthusiast), Alisha Gurung (network engineer at Bhutan Telecom) and Derick Thomas (former network engineer at VSNL and Airtel Bharti) will touch on important aspects of infrastructure, database, network and enterprise security.

Day 2 – 11 May – is filled with case studies and stories about legacy code, immutable infrastructure, root-cause analysis, handling dependencies and monitoring. Talks from Exotel, Kayako, Intuit, Helpshift, Digital Ocean, among others, will help you evaluate DevOps tools and architecture patterns.

If you are a:

DevOps programmer
Systems engineer
Architect
VP of engineering
IT manager

you should attend Rootconf.

Birds Of Feather (BOF) sessions at Rootconf 2018 will cover the following topics:

DevSec Ops
Microservices - tooling, architecture, costs and culture
Mistakes that startups make when planning infrastructure
Handling technical debt
How to plan a container strategy for your organization
Evaluating AWS for scale
Future of DevOps

Rootconf is a conference for practitioners, by practitioners.

The call for proposals is closed. If you are interested in speaking at Rootconf events in 2018, submit a proposal here: rootconf.talkfunnel.com/rootconf-round-the-year-2018/

##Venue:

NIMHANS Convention Centre, Lakkasandra, Hombegowda Nagar, Bengaluru, Karnataka 560029.

Schedule, event details and tickets: https://rootconf.in/2018

For more information about Rootconf, sponsorships, outstation events, contact support@hasgeek.com or call 7676332020.

Hosted by

Rootconf

Rootconf is a community-funded platform for activities and discussions on the following topics: Site Reliability Engineering (SRE). Infrastructure costs, including Cloud Costs - and optimization. Security - including Cloud Security. more

All submissions

Previous Next

12 People managing a Million Drivers just with right Alerting and good Monitoring

Submitted Mar 10, 2018

Section: Crisp Talk Technical level: Beginner

Downtimes are part of every system and infrastructure, all you can do is to either reduce the duration of downtime or save it at the right moment with right alerting.

KeyIdeas:

Identify the critical flows of the service. Stop everything else and start doing audit work on those flows. Find dependencies and add metrics and alerts accordingly. This will save you a lot of time spent on prod issues.
Try to find the problem within first 15 minutes of your downtime, other wise it will put back pressure on other services and you wont know where the problem is.
Your monitoring dashboards should be a flow of the service in itself. So in crisis, you know till this everything was working fine and this part is the culprit.
This is something we did. We develope our own anamoly detection algorithm, based on your use case, will be talking about this also.

The talk will cover mostly on these topics, why we need all of these and how it has helped us along the time, either in reducing our system downtime or preventing it totally. How we can find each metric a important one and what can be the right alerts. Also, I will talk about the anamoly detection algorithm we use and how it has saved us from generating false alerts and missing the right ones.

Anyone can attend the talk, but people with little knowledge of TICK stack can be the right audience for it.

Outline

At Go-Jek, we have so many services but one of the core service, whose responsibility is to find a driver for each booking customer makes, has a very different scale, managing 3 million bookings and 1 million drivers everyday that is a tough job. But still, somehow, we do that task everyday. Ofcourse, we have our glitches and downtime, but with time you gain the experience to put right alerts and monitoring in your service and infrastructure to make sure you save yourselves from a mayhem and submitting a RCA of the event :P

The service I am talking about reports 1500 metrics to Grafana and these are only service level metrics, the
system level metrics are way more than that like network, cpu, memory and other. And just for the service we have around 200 alerts triggering at any point stating something is wrong. So how do we do that?

There will be a start point of your service, is it dependent on something? Put a metric there. The dependency can be of any type rabbitmq, postgres or it can be a http request. Put alert on everything. A warn level alert and a critical level alert with right threshold values. This will handle the start of your service, similarly, go ahead and watch for other flows.
Create a backup of everything, like your database is behaving weird, make a slave of it and be prepare to use it as a master at any point of time. Your redis server is not healthy, use twemproxy. In the complete talk I will be telling about how creating a backup plan made us on marginally safer side then everything else.
I will also discuss about our anamoly detection algorithm, which is based on pure statistics, and how we came up with that and testing of it based on our used case. And as the time grews, and it has more data, the algorithm keeps on refining itself. Believe me, this algorithm is very simple but has saved us a lot of times.
How we manage the system metrics, to configure our infratructure, something on iostat what should be the optimum values of disk, cpu, load, average while running a service based on different language.

Along with some stories from Go-Jek, some funny test alerts in middle of night and some which made us to focus on our alerting and monitoring more, and how we audited our systems.

(Got to know about the conference in last minute. Couldn’t find time to prepare any slides, and submitting this proposal on hustle. Willing to do changes if slides are needed, or any suggestion on the content)

Requirements

Nothing specific needed from the participants.

Speaker bio

Tilak is working as Product Engineer, at Go-Jek(Indonesia’s first Unicorn). He studied Computer Science from IIT Indore and graduated in 2017. Since then working on diffrent tech stack and many languagues including Clojure, Golang, etc. His hobbies include reading books, and listening music and he recently started blogging.

Slides

https://docs.google.com/presentation/d/1zedMBVArBgM-eBxfLx3Bax8qGyaCdDlMDC0nXfmKSjM/edit?usp=sharing