Rootconf Hyderabad edition

On SRE, systems engineering and distributed systems

Tickets

Dashboards as Code

Submitted by Sanooj Mananghat (@sanoojm) on Monday, 27 May 2019

Section: Crisp talk Technical level: Intermediate Session type: Lecture

Abstract

Understanding the state of your infrastructure and systems is essential for ensuring the reliability and stability of services. Best way to gain this insight is with a robust dashboard that visualizes data and stable alert rules which alerts when things appear to be broken.
Configuring and modifying dashboards and alerts by hand are error-prone. Versioning dashboards and alerts using “Infrastructure as code” are extremely useful in a fast-paced environment.

Outline

How to make use of the world’s top technology to build your Dashboards with more robustness without tech debt.

Why dashboards?

Existing methods of maintaining dashboards.

  • At current scale of using dashbaords and alerts, difficult to scale without proper automation.
  • Consistency issues
    • Dashboards can be easily edited by any layman resulting in consistency issues.
    • Alerts disbaled during maintenance/deployment leading to undetected incidents.
  • Automate challenges
    • 90% of the dashboards are still being created/modified by clicking on the UI.
    • Manual configuration can lead to errors.
  • No history.
    • No rollbacks possible during an unintended modification.

Solution 1: Git

  • Will solve Consistency issues and maintains history.
  • It still cannot validate the huge json files, manual review is required.
  • Automation of stored json files to dashboards are still not solved.

Solution 2: Git + Terraform

  • Terraform providers/plugins. - Automating dashboards using terrafrom providers.
  • Manage state, conflicts, support validation, rollback, everything with zero tech debt.
  • Add value to your skills.

Demo of how to create a dashboard with Terraform and Grafana

Speaker bio

Sanooj Mananghat is working as a DataSRE @ Intuit. With 9 years of experience in Devops world. A blogger, opensource enthusiast with contributions to multiple opensource projects including terraform providers.

Links

Slides

https://docs.google.com/presentation/d/e/2PACX-1vSmx0HtVlyFPO8dfwLDAGTVXq1UzBlu9AK8z0gsWpdBDXroZ-Dk7OQdMZF7BOvMxl_D1bd7duR3GEIz/pub?start=false&loop=false&delayms=3000

Comments

  • Zainab Bawa (@zainabbawa) Reviewer 4 months ago

    Thank you for the proposal, Sanooj. Couple of questions:

    1. The abstract has been written in third person. Will you be sharing a personal case study or have you abstracted the details based on some experiences and will be presenting them as abstracted knowledge?
    2. The above abstract seems to have two-three points it is trying to make:
      - Why build dashboards?
      - How to build dashboards without technical debt.
      - Building dashboards as code with Terraform.
      You have to choose between one of these pitches to make the talk more focussed.

    Rootconf participants are typically interested in listening to a war story or a use case. Personalize this talk to help participants to understand:

    1. What is the problem that you were trying to solve which led you to the current approach and tooling?
    2. Why did you zero in on this approach and on Terraform? What options did you consider for solving this problem? How did you compare the solutions and narrow down on this approach? Explain the decision-making journey.
    3. Why Terraform? Are there similar options that participants can consider? Is there a learning curve/cost/resources issue with Terraform which participants should be aware of?
    4. Explain your solution and architecture in detail.
    5. How did your team adapt to this solution? What has been the impact of this solution on the metrics and challenges that you have outlined?
    6. What was the situation before you developed this solution? Explain both before and after scenarios.

    You can either respond to the comments by:

    1. Editing your proposal to incorporate the above.
    2. OR, create and upload draft slides to detail the decision-making and outcome journey.
    3. OR, respond to comments here.
  • Sanooj Mananghat (@sanoojm) Proposer 4 months ago

    Hi Zainab,
    Thanks for reviewing. Please find my answers inline.

    1. The abstract has been written in third person. Will you be sharing a personal case study or have you abstracted the details based on some experiences and will be presenting them as abstracted knowledge?

      I will be sharing my experience on how we were using dashboards and when moved to a bigger scale, what bottle necks we have found. What other options we have tried and why we ended up using terraform. Yes, this my experience.

    2. The above abstract seems to have two-three points it is trying to make:
      - Why build dashboards?

      Just a two line intro of why dashboards.

    • How to build dashboards without technical debt.

      This is an outcome of dashboards as code with terraform. There are multiple ways of building a dashboard. Manually through UI, Automate it using some script, etc. Manual method have lots of challenges like, human error, unable to scale etc.To avoid that, we can write our own custom scripts but, maintaining a custom script have its own challenges, moreover, all the script will be doing is either to create a json form of the dashboard or to store a huge json formated file. These are huge json files and unable to review/identify any mistakes.

    • Building dashboards as code with Terraform.

      Terrafrom plugins are avaialable for almost all dashboard vendors, let is be kibana, grafana, wavefrom, signalform etc. We dont have to reinvent. Moreover, terraform syntax is familiar/easy to learn/worth learning(Its the 2nd top tech skill in 2018 and 2019).

    You have to choose between one of these pitches to make the talk more focussed.

    Rootconf participants are typically interested in listening to a war story or a use case. Personalize this talk to help participants to understand:

    What is the problem that you were trying to solve which led you to the current approach and tooling?

    Consistency in dashbaords. Dashbaords/alerts configuration has become very easy that even an non-expert can go an edit/modify it through the UI. This has led to lots of consistency issues.
    - The dashboard that you have configured today might not be in the same shape tomorrow.
    - Alerts disbaled during maintenance/deployment leading to undetected incidents. There is no way to track these modifications and reert it to the recent stable state.
    - Initially we started with backup of the json files behind the dashboard. But it just helped us to recover in case of any issue with/without the recent changes.
    Automation at scale. Moving to a bigger scale, backup and restore also became challenging. Also the manual creation of dashboards and alerts through UI. We started to write our own custom scripts. And finally a python client which will interact with the dashboard API, which allows us to query and create resources like dashboards and alerts.
    - This worked well for simple dashboards Or dashboards where we have a created templates.
    - However, for custom dashboards with multiple rows and each rows with different number of cloumn, it was unusable. We had to create new code for each of such dashboards which brought us back to zero, now we have to maintain dashboard and its code.
    - Validation: Validating/Reviewing PR’s with huge json files was nearly impossible.
    Rollback: No history of previous state of the dashbaord.
    - With automation using custom scripts under git, we acheive it to an extend.

    Why did you zero in on this approach and on Terraform? What options did you consider for solving this problem? How did you compare the solutions and narrow down on this approach? Explain the decision-making journey.

    As mentioned above, we tried different solu

    Why Terraform? Are there similar options that participants can consider? Is there a learning curve/cost/resources issue with Terraform which participants should be aware of?

    All dashboard vendors supports REST API. Writing your own modules have its own chanllenges as discussed above.
    Grafana have an python module called grafanalib.
    Validation: Validation was inbuild in terraform. The declarative method of terraform helped us in validations.
    No more JSON: Instead of long json files, we jsut need to concentrate on terrafrom code.
    Introducing innner loop, dynamic blocks and complex type in Terrafrom12 and made it even easier to manage dashbaords.
    Terrafrom mainatains the state of the dashboard. And terrafrom decides whether to create, update, read or delete a resource based on the terrafrom script.
    Were able to easily handle different environment (test,prod) with terrafrom workspace.

    Explain your solution and architecture in detail.

    Will provide you alsong with the slides which i am preparing.

    How did your team adapt to this solution? What has been the impact of this solution on the metrics and challenges that you have outlined?

    Most of us were already comfortable with Terrafrom as we use it in our day to day work. Those who don’t were abel to adapt easily since terraform is really easy to learn and its worth learning(Its the 2nd top tech skill in 2018 and 2019).
    Creating dashbaord just become easier, intresting and a value add.

    What was the situation before you developed this solution? Explain both before and after scenarios.

    All the above plus, With new automation we were able to add remove new components dynamically. We even have terrafrom running in a serverless updating our dashboards on any change.

  • Zainab Bawa (@zainabbawa) Reviewer 4 months ago

    Thanks for the slides, Sanooj. Here are a bunch of important points that came up in the review which you need to think through and articulate on your slides:

    1. The talk has to be centred around the merits of dashboards as code. Terraform is how you accomplished this. For example, this comment: “Propser is trying to show how to automate one of the problems in the domain of monitoring. He is aiming for the “how” with a bounded solution.”
    2. Therefore, have a slide showing merits of dashboards as code. Drive home the point that you don’t need to point and click to generate visualizations.
    3. Moving to treating dashboards as code is a cultural change. Show the before and after scenario, and how the team adapted to this cultural change.
  • Zainab Bawa (@zainabbawa) Reviewer 4 months ago

    Specifically on the slides:

    1. Remove the slide on Terraform as the second top tech skill of 2018. This is not so much a talk about Terraform, as it is about dashboards as code.
    • Sanooj Mananghat (@sanoojm) Proposer 4 months ago

      Thanks for your feedback. I am reworking on those slides. Will get back ASAP.

    • Sanooj Mananghat (@sanoojm) Proposer 4 months ago

      Hi Zainab,

      I have updated the slides. Please let me know your suggestions.

      Thanks,
      Sanooj M

  • Zainab Bawa (@zainabbawa) Reviewer 4 months ago

    The revised slides look better. But the story is still not coming through because there is little during and after details of your journey.

    The before part also seems to be based on some general principles rather than the story of your team. What am I missing?

  • Anwesha Sarkar (@anweshaalt) Reviewer a month ago

    The feedback from today’s rehearsal :

    Time taken :12 minutes.

    1. Include more tech details.
    2. What is the take away?
    3. What is the focus of the talk?
    4. Include a conclusion.
    5. Include problem statements.
    6. Include war stories.
    7. Include the state
    8. How did your life changed before and after using this tool?
    9. What are the problems you have faced after you have implemented this one?
    10. Include the contact credentials at the end slide.
    11. External Consumer slide : what is the relatability of the model in this situation? The pictorial representation has to be changed.
    12. Include a demo.
    13. Who are your target audience for this talk?
    14. Do you have a community? How big is the community?

    Submit your revised by 9th Monday.

    Regards
    Anwesha

    • Sanooj Mananghat (@sanoojm) Proposer a month ago

      Anwesha,
      I am working on this. Will get back by 12th.
      Thanks,
      Sanooj M

      • Sanooj Mananghat (@sanoojm) Proposer 18 days ago

        Anwesha,
        I have updated the slides. Please review.

        Thanks,
        Sanooj M

  • Zainab Bawa (@zainabbawa) Reviewer 4 months ago (edited 4 months ago)

    Thank you for the detailed response, Sanooj. What is missing here are the details of the approach and implementation which you have to show us in your slides in order to fully evaluate your proposal. Note that participants at Rootconf are practitioners, and they need details to fully appreciate the problem and to evaluate whether your approach is something they can consider for their use case.

    Upload your detailed slides by 7 June, so that we can evaluate and let you know our decision.

    • Sanooj Mananghat (@sanoojm) Proposer 4 months ago

      Hi Zainab,

      Thanks for your quick response. I have uploaded the slide. Please review and let me know your suggestions. I will be working on improvising the slides.
      Thanks,
      Sanooj M

Login with Twitter or Google to leave a comment