Serving a model that doesn't fit on one GPU: the communication tax of distributed inference

Mar 2026

23 Mon

24 Tue

25 Wed

26 Thu

27 Fri

28 Sat 11:00 AM – 01:00 PM IST

29 Sun

Serving a model that doesn't fit on one GPU: the communication tax of distributed inference

Submitted Jun 7, 2026

Abstract

When a model’s weights and KV cache exceed the memory of a single GPU, inference serving becomes a distributed-systems problem. The model state must be partitioned across devices, and once it is, every forward pass requires data movement across an interconnect. As a result, the primary design constraint is often communication cost rather than raw compute.

This talk examines LLM inference serving as a distributed system characterised by a high communication-to-compute ratio. It covers the three principal strategies for partitioning a model across GPUs and the communication cost associated with each:

Tensor parallelism, which splits individual matrix multiplications across devices and requires an all-reduce operation at multiple points in every layer. Its synchronous, bandwidth-intensive nature generally confines it to a single high-bandwidth domain such as NVLink.
Pipeline parallelism, which divides the model into sequential stages and exchanges activations through point-to-point communication. It scales across nodes more economically but introduces pipeline idle time and additional per-request latency.
Expert parallelism, used for Mixture-of-Experts models, which distribute experts across devices and route tokens to them. Here, the dominant costs are all-to-all communication and load imbalance across experts.

Key takeaways

Two distinct reasons to use multiple GPUs. Fitting a model that is too large for one device (partitioning) is a different problem from serving higher request volume (replication), and each calls for a different approach.
Each parallelism strategy is best understood by its communication pattern. Tensor parallelism relies on frequent synchronous all-reduce operations and is bandwidth-bound; pipeline parallelism uses inexpensive point-to-point transfers but incurs pipeline idle time; expert parallelism depends on all-to-all communication and is sensitive to load imbalance.
The interconnect determines the topology. The available bandwidth between devices (NVLink, PCIe, InfiniBand, Ethernet) constrains where tensor parallelism can be applied and where pipeline or expert parallelism becomes necessary.
Inference is harder to partition than training. Single-token decoding makes per-step collective-communication overhead proportionally significant, which matters in a latency-sensitive serving context.
Scaling has limits. Beyond a certain point, adding GPUs increases communication cost faster than it reduces per-device compute, producing diminishing and eventually negative returns. Identifying that point is part of capacity planning.

Speak at Bengaluru Systems meet-up

Serving a model that doesn't fit on one GPU: the communication tax of distributed inference

Comments