May 2018
7 Mon
8 Tue
9 Wed
10 Thu 08:15 AM – 05:25 PM IST
11 Fri 08:30 AM – 06:20 PM IST
12 Sat
13 Sun
Tilak Lodha
Downtimes are part of every system and infrastructure, all you can do is to either reduce the duration of downtime or save it at the right moment with right alerting.
KeyIdeas:
The talk will cover mostly on these topics, why we need all of these and how it has helped us along the time, either in reducing our system downtime or preventing it totally. How we can find each metric a important one and what can be the right alerts. Also, I will talk about the anamoly detection algorithm we use and how it has saved us from generating false alerts and missing the right ones.
Anyone can attend the talk, but people with little knowledge of TICK stack can be the right audience for it.
At Go-Jek, we have so many services but one of the core service, whose responsibility is to find a driver for each booking customer makes, has a very different scale, managing 3 million bookings and 1 million drivers everyday that is a tough job. But still, somehow, we do that task everyday. Ofcourse, we have our glitches and downtime, but with time you gain the experience to put right alerts and monitoring in your service and infrastructure to make sure you save yourselves from a mayhem and submitting a RCA of the event :P
The service I am talking about reports 1500
metrics to Grafana and these are only service level metrics, the
system level metrics are way more than that like network, cpu, memory and other. And just for the service we have around 200 alerts triggering at any point stating something is wrong. So how do we do that?
There will be a start point of your service, is it dependent on something? Put a metric there. The dependency can be of any type rabbitmq, postgres or it can be a http request
. Put alert on everything. A warn level alert and a critical level alert with right threshold values. This will handle the start of your service, similarly, go ahead and watch for other flows.
Create a backup of everything, like your database is behaving weird, make a slave of it and be prepare to use it as a master at any point of time. Your redis server is not healthy, use twemproxy. In the complete talk I will be telling about how creating a backup plan made us on marginally safer side then everything else.
I will also discuss about our anamoly detection algorithm, which is based on pure statistics, and how we came up with that and testing of it based on our used case. And as the time grews, and it has more data, the algorithm keeps on refining itself. Believe me, this algorithm is very simple but has saved us a lot of times.
How we manage the system metrics, to configure our infratructure, something on iostat
what should be the optimum values of disk, cpu, load, average
while running a service based on different language.
Along with some stories from Go-Jek, some funny test alerts in middle of night and some which made us to focus on our alerting and monitoring more, and how we audited our systems.
(Got to know about the conference in last minute. Couldn’t find time to prepare any slides, and submitting this proposal on hustle. Willing to do changes if slides are needed, or any suggestion on the content)
Nothing specific needed from the participants.
Tilak is working as Product Engineer, at Go-Jek(Indonesia’s first Unicorn). He studied Computer Science from IIT Indore and graduated in 2017. Since then working on diffrent tech stack and many languagues including Clojure, Golang, etc. His hobbies include reading books, and listening music and he recently started blogging.
https://docs.google.com/presentation/d/1zedMBVArBgM-eBxfLx3Bax8qGyaCdDlMDC0nXfmKSjM/edit?usp=sharing
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}