Running A Highly Available RabbitMQ Cluster

Jun 2019

17 Mon

18 Tue

19 Wed

20 Thu

21 Fri 08:45 AM – 05:40 PM IST

22 Sat 09:00 AM – 05:30 PM IST

23 Sun

NIMHANS Convention Centre, Bangalore

Running A Highly Available RabbitMQ Cluster

Submitted Mar 9, 2019

Technical level: Intermediate

At Zapier, we connect over 1000 SaaS applications and enable people to automate their workflows spanning across multiple web applications. To achieve that, we use RabbitMQ to run millions of tasks every day. It can be said to be the backbone of Zapier.

We were using RabbitMQ in clustering mode in Zapier for scalability. We soon realised that RabbitMQ clustering is designed for scalability and not for high availability. If a node failed in the cluster, queues on that node will be lost and it also took out the other nodes from service. Read more here. Although RabbitMQ has a mirroring feature that replicates queues across multiple nodes, it does not distribute load across these nodes since consumers connect only to the master. During a failover, there’s also a chance that previously unacknowledged messages will get redelivered.

In this talk, we will dive into how we architected an alternative clustering solution that treated each RabbitMQ node as a stand-alone node, thereby tolerating node failures without disrupting the other nodes.

Outline

Setting up the scene: Current scale at Zapier and how Rabbit is a crucial piece of our architecture
Understand the shortcomings of native RabbitMQ clustering with a simple example
Vision: A highly available, durable, scalable RabbitMQ cluster
Designs considered
Implementation details of chosen design
Demo
The Future

Requirements

Basic understanding of message queues

Speaker bio

Kishore works as a Site Reliability Engineer at Zapier. He loves working on distributed systems and gets a kick out of designing for high availability and scale.

Rootconf 2019