Abstract
When a model’s weights and KV cache exceed the memory of a single GPU, inference serving becomes a distributed-systems problem. The model state must be partitioned across devices, and once it is, every forward pass requires data movement across an interconnect. As a result, the primary design constraint is often communication cost rather than raw compute.
This talk examines LLM inference serving as a distributed system characterised by a high communication-to-compute ratio. It covers the three principal strategies for partitioning a model across GPUs and the communication cost associated with each:
- Tensor parallelism, which splits individual matrix multiplications across devices and requires an all-reduce operation at multiple points in every layer. Its synchronous, bandwidth-intensive nature generally confines it to a single high-bandwidth domain such as NVLink.
- Pipeline parallelism, which divides the model into sequential stages and exchanges activations through point-to-point communication. It scales across nodes more economically but introduces pipeline idle time and additional per-request latency.
- Expert parallelism, used for Mixture-of-Experts models, which distribute experts across devices and route tokens to them. Here, the dominant costs are all-to-all communication and load imbalance across experts.
Key takeaways
- Two distinct reasons to use multiple GPUs. Fitting a model that is too large for one device (partitioning) is a different problem from serving higher request volume (replication), and each calls for a different approach.
- Each parallelism strategy is best understood by its communication pattern. Tensor parallelism relies on frequent synchronous all-reduce operations and is bandwidth-bound; pipeline parallelism uses inexpensive point-to-point transfers but incurs pipeline idle time; expert parallelism depends on all-to-all communication and is sensitive to load imbalance.
- The interconnect determines the topology. The available bandwidth between devices (NVLink, PCIe, InfiniBand, Ethernet) constrains where tensor parallelism can be applied and where pipeline or expert parallelism becomes necessary.
- Inference is harder to partition than training. Single-token decoding makes per-step collective-communication overhead proportionally significant, which matters in a latency-sensitive serving context.
- Scaling has limits. Beyond a certain point, adding GPUs increases communication cost faster than it reduces per-device compute, producing diminishing and eventually negative returns. Identifying that point is part of capacity planning.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}