Rootconf 2017

On service reliability

Adventures in Postgres management

Submitted by Ramanan Balakrishnan (@ramananbalakrishnan) on Jan 14, 2017

Section: Full talk of 40 mins duration Technical level: Intermediate Status: Confirmed & Scheduled

Abstract

The talk will cover deployment, maintenance and monitoring strategies for large-scale Postgres instances. Starting with a story about unexpected performance issues, the journey will continue down the rabbit hole of database management. Concurrency control solutions in Postgres, their benefits, drawbacks and consequences will be discussed - via a post-mortem analysis of an (almost-catastrophic) incident. Structured as a series of investigation-action-learning cycles - the aim is to convey the inquisitive mindset required of DevOps engineers to maintain high-reliability systems.

Outline

Introduction and setup [~ 5 mins]

The aim of this section is to provide a gentle introduction to the system being examined. The use-case, requirements and setup of the tools will be discussed to benefit those not familiar with the intricacies of the domain.

topics covered:

  • basic overview of Postgres/SQL database management
  • solutions for concurrency control via transaction IDs
  • caveats to these solutions and its consequences

Deep dive into the story [~ 20 mins]

This section will form the crux of the talk. Beginning with investigations into reports of degraded performance, the story leads towards specific limitations of Postgres’s MVCC implementation.

Tools used in investigating the incident and various attempted solutions will be discussed. The focus will be on diagnosing/fixing systems without compromising on underlying performance and uptime requirements.

topics covered:

  • overview of use-case, system setup and configuration specifics
  • initial monitoring and tracking system performance reports
  • tools used to investigate the degradation
  • first attempt at a fix, its failure to correct, and subsequent learning
  • second attempt to brute-force a solution, and the resulting (failed) inconsistent state
  • final attempt at a simpler solution, and learning

Lessons, Upgrades and Conclusion [~ 10 mins]

The last section will cover lessons learned - including checks and systems that could have avoided the problem. The various knobs available to DevOps engineers to manage the performance of their systems will be explored. Combining personal experiences with industry best practices, the aim is to deliver concrete steps to simplify database management. Also, in the months since the incident, patches in newer Postgres releases which fix the problem will be listed.

topics covered:

  • knobs to turns and levers to pull when tuning systems
  • best practices to simplify Postgres management roles
  • updates and relevant patches available via newer releases

Requirements

Even though the session is aimed at the novice DevOps engineer, a general understanding of relational databases and their performance requirements/constraints would help.

Speaker bio

I am a member of the data science team at Semantics3 - building data-powered software for ecommerce-focused companies. Over the years, I have had the chance to dabble in various fields covering data processing, pipeline setup, database management and data science. When I am not picking locks, or scuba diving, I usually blog about my technical adventures at our team’s engineering blog.

Links

Slides

https://docs.google.com/presentation/d/1Uf2t4f_MmDqth5X21e9CsXweYe63_5Y7fbg4Rh2Gefk/pub

Preview video

https://www.youtube.com/watch?v=-nv9pGBHlnQ

Comments

{{ errorMsg }}

You need to be a participant to comment.

Login to leave a comment