Jun 2019
17 Mon
18 Tue
19 Wed
20 Thu
21 Fri 08:45 AM – 05:40 PM IST
22 Sat 09:00 AM – 05:30 PM IST
23 Sun
Jun 2019
17 Mon
18 Tue
19 Wed
20 Thu
21 Fri 08:45 AM – 05:40 PM IST
22 Sat 09:00 AM – 05:30 PM IST
23 Sun
Total ₹0
Cancellation and refund policy
Memberships can be cancelled within 1 hour of purchase
Workshop tickets can be cancelled or transferred upto 24 hours prior to the workshop.
For further queries, please write to us at support@hasgeek.com or call us at +91 7676 33 2020.Submitted May 31, 2019
The work of a Site Reliability Engineer is either misconstrued or portrayed less frequently in the tech community. Most of the SRE stories get lost in daily on-call schedules. This talk opens up a gold mine by discussing the issues an SRE team faces and the solutioning done around them. The intent is to provide the audience with a set of case studies which deal with distributed environments and scale.
The work of an SRE team is to serve a single purpose of shipping code in a fast, reliable and economical manner. 4 Key Principles of SRE are - Measuring risk factors, automation, visibility and simplicity.
Infrastructure is the entrypoint to deploy code to production. While cloud providers do make this task easy, there are deeper Risk Factors we had to measure, such as - versioning, locking access to concurrent updation of resources and enabling webhooks. We developed a tool called Tessellate to do just this.
Choosing the right scheduler for the workload and the types of services helps maintain simplicity across all deployments.
We discuss the risk factors involved in our initial scheduler deisgn, and how we solutioned a service to circumvent the risks.
We discuss network related administrative tasks, to understand why automation is an essential principle of SRE. We walkthrough 2 scenarios where we automated processes and workflows by building light weight services.
We discuss how observability is much more than merely gathering metrics. We understand what visibility means, and see how a good monitoring solution helps us gain the right amount of visibility into production systems.
The core belief of an SRE team is to solve problems for the larger good and not restrict ourselves to the problem at hand. Every single tool we use was solutioned keeping this intent in mind. Each solution was a step towards better visibility and access to production systems and each of these solutions made our on-call shifts manageable. The key takeaway from this talk would be to follow the same approach and taking a step back and thinking twice before doing something manually, thinking whether this problem is repeatable, reusable and can be automated in a simple manner.
N/A
Talina is a software engineer @ Trusting Social.
She has worked on data intensive projects primarily written in Java, using Spark.
She was recently exposed to the world of Site Reliability Engineering, where she worked on designing and implementing Monitoring and Alerting systems for a large scale infrastructure.
https://docs.google.com/presentation/d/1SZb33H2x5Y9lGisQb-lNFgHi3z1P0GaDqLCh9nTGC8k/edit?usp=sharing
Login to leave a comment
Zainab Bawa
@zainabbawa Editor & Promoter
I love the war stories, @talina06. Very well done.
Couple of things to add:
Rest, let's take feedback in the rehearsal this evening. Best of luck.
Zainab Bawa
@zainabbawa Editor & Promoter
Thanks for the submission Talina. Some comments on structuring the talk:
Talina Shrotriya
@talinashro Submitter
Hi Zainab. Thank you for all the pointers, they were helpful. I've tried to accomodate your feedback in my slides. (attached in the proposal) Hope you can go through the slides and advise further.
-Talina