Review date and time - 4 February 2025, 6 PM - 7 PM
Presenter - Raj Suvariya
Reviewers - Alok Singh, Pramod Biligiri
Link to slides: https://docs.google.com/presentation/d/1VkEOI1UK--WsTKeXscaY00YS_8nmmEQz/edit?usp=sharing&ouid=101938961187815333014&rtpof=true&sd=true
Note: Raj will be presenting this talk at the Hyderabad meetup organized by the FOSS on Saturday, 8 February
- The speaker asked the audience about their SQL cluster management experience.
- Common challenges mentioned:
- Schema changes
- General operations
- Backup & restore
- Ensuring system reliability
- Challenges categorized into operations and maintenance, highlighting difficulties in non-serverless environments.
- The talk focused on automating TiDB planned maintenance using Kubernetes at Flipkart.
- Overview of TiDB architecture:
- TiDB – Stateless SQL layer, processes queries.
- PD (Placement Driver) – Manages metadata, topology, and scheduling.
- TiKV – Core data store, uses RocksDB for storage and write-ahead logs.
- TiFlash – Optimized for analytical queries.
- TiCDC – Manages change data capture.
- TiKV and PD require high availability to prevent data loss and cluster disruptions.
- Flipkart deployed TiDB on Kubernetes, leveraging:
- Mixed hardware storage
- Local PV (Persistent Volumes) for performance optimization.
- Lack of zero-downtime patching
- Manual maintenance drawbacks:
- On-call burden
- Risk of human error
- Frequent scheduled maintenance needs
- Automation solution explored to reduce operational overhead.
- Chose a k8s operator-based approach for:
- Native Kubernetes integration
- Control loop mechanisms
- Fault tolerance & high availability
- Extensibility
- Implemented simple Custom Resource Definitions (CRDs)
- Automated manual steps with reconciliation loops.
- Lessons from automation:
- Queue management to avoid multiple reconciliations for the same resource.
- Tuning resync period for eventual consistency lag
- Ensuring idempotency and lightweight operations in the reconciler.
- Impact:
- Significant time savings.
- Zero incidents during automated maintenance.
- Total fleet size is ~1 PB
- The system was exponentially growing due to proven onboarding processes.
- Managed ~100 database clusters.
- Largest cluster was around ~30-40 TB.
- Expected TiDB to behave like a SQL database.
- The experience resonated more with managing Redis clusters.
- Asked if TiDB was used for short-term storage or long-term persistence.
- Answer: TiDB was used for long-term storage.
- Suggested including scale metrics in the presentation.
- Would help audience better understand the operational scope.
- Asked if the official TiDB operator was extended or if a new operator was developed.
- Answer: A new operator was created, with plans to contribute back to the official one.
- Questioned if managing two operators was an anti-pattern.
- Answer: Not an issue, as they handled infrastructure-level tasks without conflicting with TiDB’s internal state and actions are idompotent.
- Noted an impedance mismatch between Kubernetes’ operating model and Persistent Volumes (PVs).
- Asked why local PVs were chosen with k8s despite these challenges.
- Answer: Local PVs provided higher performance, which network-attached storage couldn’t match.
- Despite affinity challenges, the team prioritized performance over convenience.
-
The way questions were framed felt more like a test rather than engaging the audience.
-
Code Snippets were hard to follow
- Suggested simplifying code snippets to improve clarity.
- Even with Go and Kubernetes operator experience, it was difficult to follow the arguments made in the code.
- The problem statement should be introduced much earlier.
- Highlight the value proposition earlier to keep the audience engaged.
- Most attendees may not be familiar with PIDB + TiDB + Kubernetes, but they will appreciate automation benefits.
- Mention scale (if allowed) using rough figures like terabytes or petabytes.
- The first 5-6 slides were too detailed; they should be more concise.
- Too many bullet points made the beginning feel like a paper summary rather than an experience report.
- Move the “TiDB Query Processing” diagram earlier as it explains components and control flow well.
- Consider using an actual query example in the flow diagram.
- Reduce the focus on TiDB details initially, as it delayed discussing the actual solution.
- The “OLAP” concept and TiFlash can be removed as they don’t contribute to the core problem.
- Remove or shorten the “OLAP” explanation unless necessary.
- Consistently refer to “Placement Driver” instead of “PD” to avoid confusion.
- If applicable, add a dashboard management UI diagram.
- Simplify the TiKV slide:
- Use 3 nodes instead of 4 for clarity.
- Replace generic color-coded “region” labels with a real-world example (e.g., an employee record).
- Clarify whether the runbook for placement driver recreation was built in-house or taken from TiDB documentation.
- If parts were self-developed, explicitly call them out.
- The “Key Learnings” slide should be split into two for better readability.
- Expand abbreviations like PV, PVC, JBOD to improve accessibility.
- Highlight custom fields in the CRD slide (e.g., pvc name, schedule time).
- Clearly differentiate Kubernetes Jobs vs. Kubernetes Operators in the alternatives section.
- Specify that the “Cloud” reference pertains to Flipkart’s private cloud, not public cloud providers.
- Provide a “References” slide before the “Thank You” slide with relevant links.
- Introduce the solution within the first 10 minutes; too much time was spent explaining TiDB concepts.
- Have a mental cutoff point to transition into the solution.
- Trim down initial technical details and bullet points.
- Structure slides to build engagement early.
- Ensure clarity in terminology and diagrams.
- Provide useful references for further reading.
- The feedback mainly focused on presentation clarity rather than technical content.
- Overall, improve flow, conciseness, and audience engagement.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}