Rootconf 2019

Rootconf 2019

On infrastructure security, DevOps and distributed systems.

About Rootconf 2019:

The seventh edition of Rootconf is a two-track conference with:

  1. Security talks and tutorials in audi 1 and 2 on 21 June.
  2. Talks on DevOps, distributed systems and SRE in audi 1 and audi 2 on 22 June.

Topics and schedule:

View full schedule here: https://hasgeek.com/rootconf/2019/schedule

Rootconf 2019 includes talks and Birds of Feather (BOF) sessions on:

  1. OSINT and its applications
  2. Key management, encryption and its costs
  3. Running a bug bounty programme in your organization
  4. PolarDB architecture as Cloud Native Architecture, developed by Alibaba Cloud
  5. Vitess
  6. SRE and running distributed teams
  7. Routing security
  8. Log analytics
  9. Enabling SRE via automated feedback loops
  10. TOR for DevOps

Who should attend Rootconf?

  1. DevOps programmers
  2. DevOps leads
  3. Systems engineers
  4. Infrastructure security professionals and experts
  5. DevSecOps teams
  6. Cloud service providers
  7. Companies with heavy cloud usage
  8. Providers of the pieces on which an organization’s IT infrastructure runs – monitoring, log management, alerting, etc
  9. Organizations dealing with large network systems where data must be protected
  10. VPs of engineering
  11. Engineering managers looking to optimize infrastructure and teams

For information about Rootconf and bulk ticket purchases, contact info@hasgeek.com or call 7676332020. Only community sponsorships available.

Rootconf 2019 sponsors:

Platinum Sponsor

CRED

Gold Sponsors

Atlassian Endurance Trusting Social

Silver Sponsors

Digital Ocean GO-JEK Paytm

Bronze Sponsors

MySQL sumo logic upcloud
platform sh nilenso CloudSEK

Exhibition Sponsor

FreeBSD Foundation

Community Sponsors

Ansible PlanetScale

Hosted by

Rootconf is a forum for discussions about DevOps, infrastructure management, IT operations, systems engineering, SRE and security (from infrastructure defence perspective). more
Piyush Verma

Piyush Verma

@meson10

Software/Site Reliability of Distributed Systems

Submitted May 5, 2019

Every product either dies a hero or lives long enough to hit Reliability issues.
Whether it’s your code or a service that you connect to, there will be a disk that will fail, a network that will experience partition, a CPU that will throttle, or a Memory that will fill up.
While you go about fixing this, What is the cost, both in terms of effort and business lost, of failure and how much does each nine of reliability cost?
The talk considers a sample and straightforward product and evaluates the depths of each failure point. We take one fault at a time and introduce incremental changes to the architecture, the product, and the support structure like monitoring and logging to detect and overcome those failures.

Outline

Consider a sample application:
A number that user sends an SMS text to of the form “Remind <date format> about <y>.” When it’s due, a service calls you back. User is charged for each SMS and reminders that they answer.

Where all do you think this can start failing?

Static Failures:

  • Disks
  • Network
  • CPU
  • Memory

Behaviour Failures:

  • Degradation
  • Latency
  • Freshness
  • Correctness
  • DDos

What are the right tools and strategies to measure and monitor these failure points?
What is the cost of measuring or leaving it un-measured?

There are Queues in the system. How do you monitor synchronous and asynchronous architectures?

The load has started to increase, but before we discuss strategies Let’s discuss CAP quickly.
How do we decide if we need sharding, better CPU or Clustering?

How do we add backups? Should they be asynchronous or synchronous?
Criteria to consider before picking up a strategy.

So far, we have been reactive about failures. How do we move to a proactive model?
And Meanwhile, could you trace that request from that particular user for me?

At what stage and how do we start injecting reliability as a part of the Software development process?

Lastly, while all of this is said to improve and fix things, how do we prove that it does? How do you validate that MySQL replicas come back when the master dies. The only way to know is by simulating. How do we set up Simulations? A decade ago it used to be called FMEA; now it’s called Chaos Engineering.

And oh, we should also discuss Site vs Software Reliability.

Requirements

  • Don’t bring your Laptop.
  • Bring your questions.

Speaker bio

I head Site Reliability Engineering at Trustingsocial.com, where we credit-score nearly half-billion users across 3 countries, 5 datacenters, 3 clouds and on our way to credit-score 1-billion people across South-East Asia.

I have been working on Infrastructure Engineering for almost a Decade, from the days when things would break they would make a sound.
I have had the fortune of learning these skills from some top engineers while scaling fairly large complex database systems like Cassandra, building an Iaas platform, or building our own microservice communication bus.

The talk is a consortium of my learnings over the past 15 years and I hope that it could help engineers/architects as well.

Links

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Rootconf is a forum for discussions about DevOps, infrastructure management, IT operations, systems engineering, SRE and security (from infrastructure defence perspective). more