Jun 2026
15 Mon
16 Tue
17 Wed
18 Thu
19 Fri 02:00 PM – 06:00 PM IST
20 Sat
21 Sun
Pravein Govindan Kannan
@praveingk
Submitted Jun 8, 2026
Description
Most production inference clusters today are single-vendor — not because it is
optimal, but because it is the simplest way to set things up. Real fleets are
accumulating heterogeneity, through procurement cycles, supply constraints,
and the widening cost gap between accelerators. The open question is whether a
single Kubernetes-native serving layer can take a heterogeneous GPU fleet and
beat plain k8s round-robin on throughput and time-to-first-token, with no
application-level changes. This lightning talk reports what we measured.
We benchmarked llm-d (a CNCF inference framework built on vLLM and the
Gateway-API InferencePool) on the NxtGen sovereign cloud’s 3-vendor cluster:
4× NVIDIA H100-NVL + 8× AMD MI325X + 8× Intel Gaudi3 over a shared 100 G RoCE
fabric, serving ibm-granite/granite-4.1-8b and sarvamai/sarvam-30b. Across
single-vendor pools (NVIDIA-only, AMD-only, Gaudi-only) and heterogeneous pools
(NVIDIA+AMD, NVIDIA+AMD+Gaudi), llm-d’s prefix-cache-aware routing
delivers +25 to +91% throughput and 5–22× better TTFT vs plain
Kubernetes round-robin — and the advantage grows with pool size and
heterogeneity. The biggest win is on the 20-pod 3-vendor pool, where llm-d
hits +91% throughput at the same load.
Artifacts are available in https://github.com/praveingk/llmd-benchmarking-nxtgen
Take-aways:
Audience:
Bio:
Pravein Govindan Kannan is a Staff Research Scientist at IBM Research working on Systems and Networking for AI Inference. He contributes to open-source projects like llm-d, UCCL and NIXL.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}