Lessons learned while managing a Terabyte scale database for three nines of uptime

Nov 2023

13 Mon

14 Tue

15 Wed

16 Thu

17 Fri 09:00 PM – 10:25 PM IST

18 Sat 09:50 PM – 10:25 PM IST

19 Sun 09:00 PM – 09:40 PM IST

Nov 2023

20 Mon 09:00 PM – 10:10 PM IST

21 Tue 09:00 PM – 09:45 PM IST

22 Wed 09:00 PM – 09:40 PM IST

23 Thu

24 Fri 10:00 AM – 05:20 PM IST

25 Sat

26 Sun

Bangalore International Centre (BIC), Bengaluru,

Lessons learned while managing a Terabyte scale database for three nines of uptime

Submitted Oct 16, 2023

Title

Lessons learned while self-managing Terabyte scale database for three nines of uptime

Managing uptime for stateless systems is relatively easy, you can scale (horizontally and vertically) by throwing more hardware. However, the uptime and reliability of stateful systems (such as databases) is hard. This talk covers some lessons learned managing production databases with Terabytes of data to achieve three nines of uptime.

I had to manage the uptime and scalability of the self-managed production MySQL cluster for a Fintech company (Flip.id). The transactional MySQL database cluster was 1.5TB in size and grew 6 GB per day. There were six database nodes, each with 32vCPU, 128GB RAM, and 2TB disks.

Some of the challenges I had to solve:

Observability and uptime monitoring
Scalability (disks, compute, etc.)
Read-write traffic routing across various DB nodes
Schema migrations
Managing database security
Controlling Replication lag
Making data available for analytics use case
Numerous prod incidents related to database uptime and performance

Each of these bullet points above is worthy of a talk in itself. However, I’ll cover all these challenges and how we solved them in our case. We started with very limited observability and gradually transitioned to three nines of uptime for the database. We eventually moved from a self-managed cluster to a cloud SaaS service (GCP’s CloudSQL), but that story is for another time. 😀

What’s in it for you?

You’ll learn tools and patterns that you can apply in your own work if you’re managing any stateful system (database, queues, etc).

You’ll learn the importance of:

System decoupling (when I cover ProxySQL and how it helped us decouple components)
Power of operationally simple tools (gh-ost and how it simplified schema migrations for us)
Dev-prod parity and testing your approaches with prod scale in your staging environment

There will be a lot of diagrams and storytelling instead of just bullet point slides for you to read. I recently spoke at RubyConf India about lessons for managing trade-offs between over-engineering and the Big-ball-of-mud when building software systems. I am attaching the video of that talk to give the review committee an idea about how I speak.

SRE Conf 2023