Jan 2025
20 Mon
21 Tue 06:00 PM – 07:30 PM IST
22 Wed
23 Thu
24 Fri
25 Sat
26 Sun
Submitted Jan 22, 2025
Review date and time - 21 January 2025, 6 PM - 7 PM
Presenter - Snehashish Roy (Software Engineer at PhonePe)
Why Clockwork was built
PhonePe, a large fintech company in India, needed a scalable and fault-tolerant job scheduler to handle asynchronous tasks like transaction reconciliation, reminders, and merchant settlements. Embedded schedulers were not suitable due to potential data loss and application logic pollution.
Clockwork’s capabilities
Clockwork is a distributed, persistent, multi-tenanted, fault-tolerant job scheduler. It can schedule actions at specific times, handle bursty traffic, support multiple clients, and ensure durability and at-least-once delivery guarantees. It is horizontally scalable and runs on commodity hardware.
Tech stack and implementation
Clockwork uses Apache Zookeeper for consensus, RabbitMQ for queuing, and Apache HBase as the database. The data schema in HBase uses a row key based on client ID, partition ID, and timestamp. A leader elector module assigns partitions to workers using Zookeeper. A job extractor queries HBase for eligible jobs and executes them using RabbitMQ for decoupling.
Challenges and solutions
Challenges included process hot spotting, which was addressed by pre-splitting HBase regions and sharding RabbitMQ and HBase clusters. Rate limiters were implemented to maintain quality of service. Stability was achieved through benchmarking, using quorum queues in RabbitMQ, and adding metrics and events.
🔗 Link to the slides shown at the review -https://docs.google.com/presentation/d/1Zxwgpcz-dYYbt3sZ44pcEuHk0erkAWqbkrr9E7Gh5xU/
Yagnik Khanna provided feedback on the presentation style, structure, and technical content.
Manner: The presentation was detailed but monotonous; Yagnik suggested more voice modulation and storytelling, focusing on the problems faced and PhonePe’s thought process in finding solutions.
Method: The setup was too long and could be compressed; details about functioning of Zookeeper and RabbitMQ were unnecessary. The focus should be on the problems and solutions, with comparisons to other technologies and why certain choices were made.
Matter: Yagnik asked about open source competitors, idempotency, debouncing, retry and observability (metrics). He questioned the relevance of the technology, given the current landscape, and whether it’s being maintained due to legacy.
Overall feedback: Yagnik emphasized the importance of storytelling, focusing on real-life problems and learnings, and justifying the technical choices made. He also suggested being mindful of the presentation’s relevance and leaving room for audience interaction.
Srinivas Devaki summarized that the speaker should focus on creating a balance between the problem, solution, and solution challenges. The speaker should spend more time on the problems faced and the reasoning behind the chosen solutions, rather than explaining the tools themselves. The audience will be more interested in the 30% of the presentation that covers the problems and challenges, as they will likely already have some understanding of the solution.
Srinivas disagreed with the suggestion that the speaker should cut out Zookeeper and RabbitMQ completely. Srinivas suggested mentioning the tech, but focusing on the problem they solve and the challenges faced in implementing the solution. The reasoning is that the audience will be able to infer the solution if the problem is explained clearly, and that they are more interested in the problems and challenges than the tools themselves.
Srinivas pointed out that the decoupling explanation lacks clarity, questioning why it’s unacceptable for publishers but acceptable for consumers to get stuck on HTTP API calls. He also found the scalability explanation lacking, stating that it doesn’t delve into how Clockwork achieves scalability in its components and problem domains, specifically asking about partition scaling for high QPS and how acceptors handle bursty workload.
Srinivas Devaki also provided feedback on three other aspects:
Towards the end, Srinivas complimented the system design, especially the use of HBase, and mentioned it’s ideal for a scheduler database. Srinivas also praised the leader election model and decoupling, stating that these are impressive design choices for a complex distributed architecture like a scheduler. Even AWS released a multi-tenant scheduler only a couple of years ago, highlighting the difficulty of the task.
Owing to an issue at work, Harsh Mittal wasn’t able to participate in the review actively.
Audience at the review participated after reviewers gave their feedback. Below is the summary of the same.
Pramod had two main questions:
Snehasish Roy responded that Zookeeper works well for their use case because it’s read-optimized and their metadata isn’t write-heavy. They only store around 2,000-3,000 keys, and the read QPS is manageable.
Regarding HBase, they self-manage it and have found it cost-effective. They did face some latency issues due to compaction, which they addressed with a custom compaction manager.
Srujan’s feedback focused on the need for stronger problem definition and justification for technical choices. Srujan emphasized that tech talks should go beyond architecture overviews, which are readily available in blog posts. Instead, the focus should be on the specific problems faced and the thought process behind the solutions chosen. Srujan also questioned the choice of RabbitMQ and HBase over other technologies, highlighting that their combination is not typically used in schedulers.
In response, Snehasish Roy explained that RabbitMQ was chosen over Kafka due to operational simplicity and the lack of a need for replayability. HBase was chosen over MariaDB due to data growth concerns and the need for fast range operations and dynamic rebalancing.
Hosted by
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}