Tickets

Loading…

Pramod Biligiri

Pramod Biligiri

@pramodbiligiri Reviewer

Alok G Singh

Alok G Singh

@al0k Reviewer

Raj Suvariya

@rajsuvariya Presenter

Summary of TiDB automation talk at Flipkart

Submitted Feb 5, 2025

Review date and time - 4 February 2025, 6 PM - 7 PM
Presenter - Raj Suvariya
Reviewers - Alok Singh, Pramod Biligiri
Link to slides: https://docs.google.com/presentation/d/1VkEOI1UK--WsTKeXscaY00YS_8nmmEQz/edit?usp=sharing&ouid=101938961187815333014&rtpof=true&sd=true

Note: Raj will be presenting this talk at the Hyderabad meetup organized by the FOSS on Saturday, 8 February


Presentation summary

SQL cluster management challenges

  • The speaker asked the audience about their SQL cluster management experience.
  • Common challenges mentioned:
    • Schema changes
    • General operations
    • Backup & restore
    • Ensuring system reliability
  • Challenges categorized into operations and maintenance, highlighting difficulties in non-serverless environments.

TiDB and Its components

  • The talk focused on automating TiDB planned maintenance using Kubernetes at Flipkart.
  • Overview of TiDB architecture:
    • TiDB – Stateless SQL layer, processes queries.
    • PD (Placement Driver) – Manages metadata, topology, and scheduling.
    • TiKV – Core data store, uses RocksDB for storage and write-ahead logs.
    • TiFlash – Optimized for analytical queries.
    • TiCDC – Manages change data capture.

High availability & infrastructure at Flipkart

  • TiKV and PD require high availability to prevent data loss and cluster disruptions.
  • Flipkart deployed TiDB on Kubernetes, leveraging:
    • Mixed hardware storage
    • Local PV (Persistent Volumes) for performance optimization.
    • Lack of zero-downtime patching

Challenges leading to automation

  • Manual maintenance drawbacks:
    • On-call burden
    • Risk of human error
    • Frequent scheduled maintenance needs
  • Automation solution explored to reduce operational overhead.

k8s Operator-based automation on Kubernetes

  • Chose a k8s operator-based approach for:
    • Native Kubernetes integration
    • Control loop mechanisms
    • Fault tolerance & high availability
    • Extensibility
  • Implemented simple Custom Resource Definitions (CRDs)
  • Automated manual steps with reconciliation loops.

Key learnings & results

  • Lessons from automation:
    • Queue management to avoid multiple reconciliations for the same resource.
    • Tuning resync period for eventual consistency lag
    • Ensuring idempotency and lightweight operations in the reconciler.
  • Impact:
    • Significant time savings.
    • Zero incidents during automated maintenance.

Alok’s feedback

Scale of operations

  • Total fleet size is ~1 PB
  • The system was exponentially growing due to proven onboarding processes.
  • Managed ~100 database clusters.
  • Largest cluster was around ~30-40 TB.

Key feedback points

1. Expected use case vs. reality

  • Expected TiDB to behave like a SQL database.
  • The experience resonated more with managing Redis clusters.
  • Asked if TiDB was used for short-term storage or long-term persistence.
    • Answer: TiDB was used for long-term storage.

2. Missing context on scale

  • Suggested including scale metrics in the presentation.
  • Would help audience better understand the operational scope.

3. Operator implementation

  • Asked if the official TiDB operator was extended or if a new operator was developed.
    • Answer: A new operator was created, with plans to contribute back to the official one.
  • Questioned if managing two operators was an anti-pattern.
    • Answer: Not an issue, as they handled infrastructure-level tasks without conflicting with TiDB’s internal state and actions are idompotent.

4. Kubernetes & local PVs

  • Noted an impedance mismatch between Kubernetes’ operating model and Persistent Volumes (PVs).
  • Asked why local PVs were chosen with k8s despite these challenges.
    • Answer: Local PVs provided higher performance, which network-attached storage couldn’t match.
    • Despite affinity challenges, the team prioritized performance over convenience.

Presentation feedback

  1. The way questions were framed felt more like a test rather than engaging the audience.

  2. Code Snippets were hard to follow

    • Suggested simplifying code snippets to improve clarity.
    • Even with Go and Kubernetes operator experience, it was difficult to follow the arguments made in the code.

Pramod’s feedback

General feedback

  • The problem statement should be introduced much earlier.
  • Highlight the value proposition earlier to keep the audience engaged.
  • Most attendees may not be familiar with PIDB + TiDB + Kubernetes, but they will appreciate automation benefits.
  • Mention scale (if allowed) using rough figures like terabytes or petabytes.

Structural improvements

  • The first 5-6 slides were too detailed; they should be more concise.
  • Too many bullet points made the beginning feel like a paper summary rather than an experience report.
  • Move the “TiDB Query Processing” diagram earlier as it explains components and control flow well.
  • Consider using an actual query example in the flow diagram.

Content adjustments

  • Reduce the focus on TiDB details initially, as it delayed discussing the actual solution.
  • The “OLAP” concept and TiFlash can be removed as they don’t contribute to the core problem.
  • Remove or shorten the “OLAP” explanation unless necessary.
  • Consistently refer to “Placement Driver” instead of “PD” to avoid confusion.
  • If applicable, add a dashboard management UI diagram.
  • Simplify the TiKV slide:
    • Use 3 nodes instead of 4 for clarity.
    • Replace generic color-coded “region” labels with a real-world example (e.g., an employee record).

Runbooks & documentation

  • Clarify whether the runbook for placement driver recreation was built in-house or taken from TiDB documentation.
  • If parts were self-developed, explicitly call them out.

Presentation refinements

  • The “Key Learnings” slide should be split into two for better readability.
  • Expand abbreviations like PV, PVC, JBOD to improve accessibility.
  • Highlight custom fields in the CRD slide (e.g., pvc name, schedule time).
  • Clearly differentiate Kubernetes Jobs vs. Kubernetes Operators in the alternatives section.
  • Specify that the “Cloud” reference pertains to Flipkart’s private cloud, not public cloud providers.
  • Provide a “References” slide before the “Thank You” slide with relevant links.

Time Management

  • Introduce the solution within the first 10 minutes; too much time was spent explaining TiDB concepts.
  • Have a mental cutoff point to transition into the solution.

Final Suggestions

  • Trim down initial technical details and bullet points.
  • Structure slides to build engagement early.
  • Ensure clarity in terminology and diagrams.
  • Provide useful references for further reading.

Closing Notes

  • The feedback mainly focused on presentation clarity rather than technical content.
  • Overall, improve flow, conciseness, and audience engagement.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid access (members only)

Hosted by

We care about site reliability, cloud costs, security and data privacy