Rootconf Delhi edition

On network engineering, infrastructure automation and DevOps

Tickets

The Urban Myth Of Full Uptime

Submitted by Mohammad Gufran (@notgufran) on Tuesday, 19 November 2019

Section: Full talk (40 mins) Category: SRE Status: Rejected

Abstract

Strategies to achieve high uptime at scale. The points this talk is going to cover are:

  1. A real-life case study
  2. Cloud Architecture
  3. Immutable infrastructure
  4. Infrastructure as code
  5. Secrets Management
  6. Service Discovery
  7. Container management and scheduling
  8. Blue Green Deployment
  9. Observability

Outline

  • About Me, My Company and My Situation
    • Set context for the rest of the headlines
    • Touch up on the legacy setup and infrastructure so that people can put the upcoming points in contrast with it
  • Causes of our downtime
    • Architecture
    • Poor Provisioning Practices
      • Hardware
      • Configuration
    • Lack of Monitoring
    • Missing Backups, DR and BC
    • Poor Technical Choices
      • Storing Data on single node
      • Scaling storage with LVM
      • Node local cache for distributed apps
      • Cyclic API calls
    • Security
      • Checked in secrets
      • Publicly accessible resources
      • Outdated and vulnerable versions of tools
    • Lack of Documentation and Testing
    • Takeaway - Typical problems faced in a poorly architected infrastructure
  • Architecture
    • What’s wrong with it
    • Designing immutable infrastructure
  • Poor Provisioning Practices
    • What’s wrong with it
    • Provisioning immutable resources with Terraform
    • Deploying and Configuring services in immutable fashion
  • Monitoring
    • What’s wrong with it
    • Implementing Observability
  • Backups, DR and BC
    • What’s wrong with it
    • Automated backups with redundant copies
  • Poor Technical Choices
    • What’s wrong with it
    • Fixing the mistakes made so far
  • Lack of Documentation and Testing
  • Summary

Speaker bio

Comments

  •   Anwesha Sarkar (@anweshaalt) Reviewer 4 months ago

    Hello,

    Thank you for the submission. Here are the feedback for your talk:

    1. Is Hashistack a open source project?
    2. Include some pictorial representation.
    3. Add take away points in your slide.
    4. Add conclusion points in your slide.
    5. What are the key carry off points for the audience and why?

    Submit you revised slides by 25th November 2019 (latest). If you have any question feel free to ping me.

    Regards,
    Anwesha

    •   Mohammad Gufran (@notgufran) Proposer 4 months ago

      Is Hashistack a open source project?

      Yes. Hashistack is a collection of tools from Hashicorp that includes Consul, Nomad, Vault, Terraform and Packer. All these tools are open source.

      Include some pictorial representation.

      Included slides are not the actual presentation, it is only a broad outline of the topics this talk is going to cover.
      Final presentation will have more material in it.

      Add take away points in your slide.

      The last slide lists down the take away points, and conclusions, albeit not in great depth.

  •   Zainab Bawa (@zainabbawa) Reviewer 4 months ago

    Hello Gufran, thanks for sharing the slides.

    The slides and the abstract say two different things. What are we missing?

    The slides also have a hiring plug. This is not acceptable at Rootconf. Rootconf is a stage for sharing insights from practice, not for pitching a company’s products or hiring. Please remove this from the slides.

    Also, please respond to the following:

    1. What is the focus of this talk?
    2. The slides have a lot of terms and seems like there is too much context being covered. We only want to see the context regarding the talk. Which brings us back to the question of: what is this talk about? What is the one point you are trying to make?
    3. Why are you advocating Hashistack? What problems does Hashistack solve which other, similar tools, don’t? Why did you pick Hashistack over other available options? How does Hashistack help you in solving the problem you are trying to solve (which is unclear, fundamentally).

    Look forward to your responses.

  •   Mohammad Gufran (@notgufran) Proposer 4 months ago

    I think the slides were causing a lot of confusion here so I’ve removed the slides for now and changed the abstract to better reflect the content. The focus of this talk is to share the strategies for achieving high uptime at scale.

    We aren’t advocating HashiStack, it is just the combination of tools we are using. We chose Hashicorp tools because of the simplicity, ease of maintenance, and focus on solving a single problem. You can use any other comparable tool.

    •   Zainab Bawa (@zainabbawa) Reviewer 4 months ago

      Thanks Gufran. The abstract and outline contain far too many topics that can be covered – with reasonable depth in a talk. If you have to prioritise, which topics will be keep and which ones will you filter out?

      More importantly, what is the takeaway for the audience from the talk? This is still unclear because:

      1. It is not clear what problem you are trying to address in your talk and why the problem is generally pervasive?
      2. The title of the talk is vague and unclear.
      3. It is not clear what you mean by “Takeaway - Give context to audience” – give context to what? Why is ‘the context’ you plan to explain to the audience important and interesting in the first place?

      Look forward to your response.

  •   Mohammad Gufran (@notgufran) Proposer 4 months ago

    The abstract and outline contain far too many topics that can be covered – with reasonable depth in a talk. If you have to prioritise, which topics will be keep and which ones will you filter out?

    This is a proposal for a full 40 minutes talk. Those topics aren’t too many for a full talk. I won’t filter out anything because all of it is required either to build context or to discuss the solution.

    1. It is not clear what problem you are trying to address in your talk and why the problem is generally pervasive?

    I am addressing the “Strategies to achieve high uptime at scale” I’ve also added that to the abstract. The problem is downtime, our solution is what I intend to talk about and the progression of the talk is laid out in the outline section. The result is High uptime.

    I think downtime is a fairly pervasive problem in all kind and of all scale of companies and everyone can benifit from our learnings.

    1. The title of the talk is vague and unclear.

    ‘Vague and unclear’ are subjective terms. Is there any guideline to choose the title of the talk? If yes, then can you please point me to it? And if no, then do I get to propose a title for my talk?

    1. It is not clear what you mean by “Takeaway - Give context to audience” – give context to what? Why is ‘the context’ you plan to explain to the audience important and interesting in the first place?

    I’m sorry about the confusion here. The sentense lost its essence because of brevity. I’ve updated it to be a bit more meaningful.

    On a side note, it appears to me that you are looking for a more ready to be delivered material. If that is the case then I don’t have any video or slides ready yet. What I have though is the storyline and the experience that I’m actively boiling down to form the core of the talk.

  •   Zainab Bawa (@zainabbawa) Reviewer 4 months ago

    Thanks for the responses, Gufran. The concerns with your proposal are following:

    1. It is unclear why your organization’s solution should be interesting for participants to listen to. In our evaluation, we look at how clearly the problem statement has been articulated, and approach/approaches the proposer has considered to solve the problem, and how the learnings can be extended to the problem per se rather than explaining what the organization’s solution is.
    2. While we don’t always look for ready-to-be-delivered talk material, we want to see progress in every iteration of responses and improvement in articulating the problem.
      At this stage, as you mentioned, the idea of the talk is evolving, and may take longer than our timelines for Rootconf Delhi edition. Given this situation, we can’t consider your proposal for a talk at Rootconf Delhi, but will be happy to do so for future editions of Rootconf. If you plan on attending the conference, you can consider doing a flash talk on the problem of uptime.
  •   Mohammad Gufran (@notgufran) Proposer 4 months ago
    1. It is unclear why your organization’s solution should be interesting for participants to listen to. In our evaluation, we look at how clearly the problem statement has been articulated, and approach/approaches the proposer has considered to solve the problem, and how the learnings can be extended to the problem per se rather than explaining what the organization’s solution is.

    I’m at a loss for words now. I’m failing to see the demands you make. Please allow me to unpack the quoted paragraph.

    It is unclear why your organization’s solution should be interesting for participants to listen to

    Because tech establishments adopt the associated cause of outage with every new tool they add to their bag.
    Startups struggle to deliver a reliable and consistent uptime because the priorities are to save the business. Decent size businesses buy uptime by throwing money at it and giants like Google and AWS just deal with it because problems at that scale are almost always seen for the first time.
    We all know about the S3 outage, recent AWS Frankfurt region degradation, EBS data loss because the North virginia datacenters lost power supply, the BGP route leak caused the largest known network partition by literally slicing the world into two. Google had to wrestle with similar network congestion to restore Youtube, Gmail, Nest and bunch of other service. And all that in just this year. Those are the companies that can hire the best minds, talk about the companies who gather most of their talent from college placements or hire people with a couple years of experience at their hand. Take Indian startups for an example now, I don’t want to mention any names here but these are the apps that you can find on everyone’s phone. A lot of people use their services, and everyone has experienced that moment of anxiety when suddenly the app stops behaving the way you expect it to and leaves you befumbled. The ordeal might last for just a couple of minutes but in those couple of minutes a lot of things happen behind the scene, including the panic among engineers who are responsible for reliable functionality of said app. Multiply that with several similar incidents spread across the week or month and it becomes the pain that causes customer churn, degrades the credibility of the company and generally makes life miserable for business, engineers and the customers alike.

    Business outage is not a phenomenon unique to Shuttl, everybody is familiar with it to the point that the number of 9s in your uptime is used as a status symbol. That is why people are always looking for ways to reduce outage and that is also why it is interesting for participants to listen to.

    Along with that, people who attend Rootconf are generally curious about the technology. In fact that is primarily the reason they attend the conference. If these people are interested in hearing about “Merging two live data-centers”, “PubSub Realtime messaging service”, and a “Service Graph”, they’ll certainly find it useful to learn something they can readily utilize to improve their daily life and workflows.

    In our evaluation, we look at how clearly the problem statement has been articulated, and approach/approaches the proposer has considered to solve the problem

    This is laid out in the abstract section.

    Strategies to achieve high uptime at scale. The points this talk is going to cover are:

    1. A real-life case study
    2. Cloud Architecture
    3. Immutable infrastructure
    4. Infrastructure as code
    5. Secrets Management
    6. Service Discovery
    7. Container management and scheduling
    8. Blue Green Deployment
    9. Observability

    Achieving high uptime at scale is somewhat different from doing so with an app used by a handful of people. It is not just engineers writing codes and deploying to production. Scale demands participation from all teams. There are new features on roadmap, PMs hounding the engineers for a quick MVP, marketing team running campaigns to engage customers and causing traffic surge, analytical models consuming and producing terabytes of data on a routine, tens of deployments to production every day each with its own risks, ad-hoc exotic requests coming from business or the auditors, and all that while the infrastructure has to go under regular maintenance.

    The list then goes on to mention the approach we took to solve the problem.

    1. While we don’t always look for ready-to-be-delivered talk material, we want to see progress in every iteration of responses and improvement in articulating the problem.

    I have not been given any feedback on lack of progress or poorly articulated problem statement. I’ve been asked several times “What is the point you are trying to make” in different forms, to which I replied with increasingly more information each time. Please consider addressing the particular instances of lack of information or objective vagueness so that I can fill the gap. If ready-to-be-delivered material is whats required then please state that explicitly.

    At this stage, as you mentioned, the idea of the talk is evolving, and may take longer than our timelines for Rootconf Delhi edition.

    I respectfully refute that. I did not say the idea is evolving nor that it would take longer than the timelines for Rootconf Delhi. What is said is following (quoted from my reply):

    “On a side note, it appears to me that you are looking for a more ready to be delivered material. If that is the case then I don’t have any video or slides ready yet. What I have though is the storyline and the experience that I’m actively boiling down to form the core of the talk.”

    Given this situation, we can’t consider your proposal for a talk at Rootconf Delhi

    If making an assumption and acting on it isn’t the standard part of the review then I’d request you to please go through the proposal once again and reconsider.

Login with Twitter or Google to leave a comment