Running A Highly Available RabbitMQ Cluster
At Zapier, we connect over 1000 SaaS applications and enable people to automate their workflows spanning across multiple web applications. To achieve that, we use RabbitMQ to run millions of tasks every day. It can be said to be the backbone of Zapier.
We were using RabbitMQ in clustering mode in Zapier for scalability. We soon realised that RabbitMQ clustering is designed for scalability and not for high availability. If a node failed in the cluster, queues on that node will be lost and it also took out the other nodes from service. Read more here. Although RabbitMQ has a mirroring feature that replicates queues across multiple nodes, it does not distribute load across these nodes since consumers connect only to the master. During a failover, there’s also a chance that previously unacknowledged messages will get redelivered.
In this talk, we will dive into how we architected an alternative clustering solution that treated each RabbitMQ node as a stand-alone node, thereby tolerating node failures without disrupting the other nodes.
- Setting up the scene: Current scale at Zapier and how Rabbit is a crucial piece of our architecture
- Understand the shortcomings of native RabbitMQ clustering with a simple example
- Vision: A highly available, durable, scalable RabbitMQ cluster
- Designs considered
- Implementation details of chosen design
- The Future
Basic understanding of message queues
Kishore works as a Site Reliability Engineer at Zapier. He loves working on distributed systems and gets a kick out of designing for high availability and scale.