Adventures in Postgres management
The talk will cover deployment, maintenance and monitoring strategies for large-scale Postgres instances. Starting with a story about unexpected performance issues, the journey will continue down the rabbit hole of database management. Concurrency control solutions in Postgres, their benefits, drawbacks and consequences will be discussed - via a post-mortem analysis of an (almost-catastrophic) incident. Structured as a series of investigation-action-learning cycles - the aim is to convey the inquisitive mindset required of DevOps engineers to maintain high-reliability systems.
Introduction and setup [~ 5 mins]
The aim of this section is to provide a gentle introduction to the system being examined. The use-case, requirements and setup of the tools will be discussed to benefit those not familiar with the intricacies of the domain.
- basic overview of Postgres/SQL database management
- solutions for concurrency control via transaction IDs
- caveats to these solutions and its consequences
Deep dive into the story [~ 20 mins]
This section will form the crux of the talk. Beginning with investigations into reports of degraded performance, the story leads towards specific limitations of Postgres’s MVCC implementation.
Tools used in investigating the incident and various attempted solutions will be discussed. The focus will be on diagnosing/fixing systems without compromising on underlying performance and uptime requirements.
- overview of use-case, system setup and configuration specifics
- initial monitoring and tracking system performance reports
- tools used to investigate the degradation
- first attempt at a fix, its failure to correct, and subsequent learning
- second attempt to brute-force a solution, and the resulting (failed) inconsistent state
- final attempt at a simpler solution, and learning
Lessons, Upgrades and Conclusion [~ 10 mins]
The last section will cover lessons learned - including checks and systems that could have avoided the problem. The various knobs available to DevOps engineers to manage the performance of their systems will be explored. Combining personal experiences with industry best practices, the aim is to deliver concrete steps to simplify database management. Also, in the months since the incident, patches in newer Postgres releases which fix the problem will be listed.
- knobs to turns and levers to pull when tuning systems
- best practices to simplify Postgres management roles
- updates and relevant patches available via newer releases
Even though the session is aimed at the novice DevOps engineer, a general understanding of relational databases and their performance requirements/constraints would help.
I am a member of the data science team at Semantics3 - building data-powered software for ecommerce-focused companies. Over the years, I have had the chance to dabble in various fields covering data processing, pipeline setup, database management and data science. When I am not picking locks, or scuba diving, I usually blog about my technical adventures at our team’s engineering blog.
- Original incident writeup - https://engineering.semantics3.com/2016/07/20/an-unexpected-dba-journey/
- Some interesting discussion that followed at https://news.ycombinator.com/item?id=12209395 and https://www.reddit.com/r/programming/comments/4vr00u/an_unexpected_journey_a_postgres_dbas_tale/
- My other technical articles - https://engineering.semantics3.com/authors/ramananbalakrishnan/