Lessons learned while self-managing Terabyte scale database for three nines of uptime
Managing uptime for stateless systems is relatively easy, you can scale (horizontally and vertically) by throwing more hardware. However, the uptime and reliability of stateful systems (such as databases) is hard. This talk covers some lessons learned managing production databases with Terabytes of data to achieve three nines of uptime.
I had to manage the uptime and scalability of the self-managed production MySQL cluster for a Fintech company (Flip.id). The transactional MySQL database cluster was 1.5TB in size and grew 6 GB per day. There were six database nodes, each with 32vCPU, 128GB RAM, and 2TB disks.
Some of the challenges I had to solve:
- Observability and uptime monitoring
- Scalability (disks, compute, etc.)
- Read-write traffic routing across various DB nodes
- Schema migrations
- Managing database security
- Controlling Replication lag
- Making data available for analytics use case
- Numerous prod incidents related to database uptime and performance
Each of these bullet points above is worthy of a talk in itself. However, I’ll cover all these challenges and how we solved them in our case. We started with very limited observability and gradually transitioned to three nines of uptime for the database. We eventually moved from a self-managed cluster to a cloud SaaS service (GCP’s CloudSQL), but that story is for another time. 😀
You’ll learn tools and patterns that you can apply in your own work if you’re managing any stateful system (database, queues, etc).
You’ll learn the importance of:
- System decoupling (when I cover ProxySQL and how it helped us decouple components)
- Power of operationally simple tools (gh-ost and how it simplified schema migrations for us)
- Dev-prod parity and testing your approaches with prod scale in your staging environment
There will be a lot of diagrams and storytelling instead of just bullet point slides for you to read. I recently spoke at RubyConf India about lessons for managing trade-offs between over-engineering and the Big-ball-of-mud when building software systems. I am attaching the video of that talk to give the review committee an idea about how I speak.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}