Rootconf 2017

On service reliability


Adventures in Postgres management

Submitted by Ramanan Balakrishnan (@ramananbalakrishnan) on Saturday, 14 January 2017

Preview video

Section: Full talk of 40 mins duration Technical level: Intermediate

View proposal in schedule


The talk will cover deployment, maintenance and monitoring strategies for large-scale Postgres instances. Starting with a story about unexpected performance issues, the journey will continue down the rabbit hole of database management. Concurrency control solutions in Postgres, their benefits, drawbacks and consequences will be discussed - via a post-mortem analysis of an (almost-catastrophic) incident. Structured as a series of investigation-action-learning cycles - the aim is to convey the inquisitive mindset required of DevOps engineers to maintain high-reliability systems.


Introduction and setup [~ 5 mins]

The aim of this section is to provide a gentle introduction to the system being examined. The use-case, requirements and setup of the tools will be discussed to benefit those not familiar with the intricacies of the domain.

topics covered:

  • basic overview of Postgres/SQL database management
  • solutions for concurrency control via transaction IDs
  • caveats to these solutions and its consequences

Deep dive into the story [~ 20 mins]

This section will form the crux of the talk. Beginning with investigations into reports of degraded performance, the story leads towards specific limitations of Postgres’s MVCC implementation.

Tools used in investigating the incident and various attempted solutions will be discussed. The focus will be on diagnosing/fixing systems without compromising on underlying performance and uptime requirements.

topics covered:

  • overview of use-case, system setup and configuration specifics
  • initial monitoring and tracking system performance reports
  • tools used to investigate the degradation
  • first attempt at a fix, its failure to correct, and subsequent learning
  • second attempt to brute-force a solution, and the resulting (failed) inconsistent state
  • final attempt at a simpler solution, and learning

Lessons, Upgrades and Conclusion [~ 10 mins]

The last section will cover lessons learned - including checks and systems that could have avoided the problem. The various knobs available to DevOps engineers to manage the performance of their systems will be explored. Combining personal experiences with industry best practices, the aim is to deliver concrete steps to simplify database management. Also, in the months since the incident, patches in newer Postgres releases which fix the problem will be listed.

topics covered:

  • knobs to turns and levers to pull when tuning systems
  • best practices to simplify Postgres management roles
  • updates and relevant patches available via newer releases


Even though the session is aimed at the novice DevOps engineer, a general understanding of relational databases and their performance requirements/constraints would help.

Speaker bio

I am a member of the data science team at Semantics3 - building data-powered software for ecommerce-focused companies. Over the years, I have had the chance to dabble in various fields covering data processing, pipeline setup, database management and data science. When I am not picking locks, or scuba diving, I usually blog about my technical adventures at our team’s engineering blog.



Preview video


  • Philip Paeps (@trouble) 2 years ago

    This is an excellent proposal!

    I like the topics you aim to cover and agree that they would be appropriate for Rootconf. Since it is safe to assume that a Rootconf audience will be at least somewhat aware of PostgreSQL and RDBMSes in general, you could keep your introduction shorter (~5 minutes) and give yourself more time for the main part of your presentation.

    • Ramanan Balakrishnan (@ramananbalakrishnan) Proposer 2 years ago (edited 2 years ago)

      Got it, thanks for the feedback!

      Yes, I agree that shortening the introduction, given the expected audience profile, could free up significant time for later sections.

      I am still thinking of ways to refine the segments and will continue updating the proposal over the next few days.

  • Yagnik (@yagnik) 2 years ago

    Looking forward to this!!

Login with Twitter or Google to leave a comment