Speak at The Fifth Elephant 2026 Annual Conference
Share you work with the community
Jul 2026
27 Mon
28 Tue
29 Wed
30 Thu
31 Fri 09:00 AM – 06:00 PM IST
1 Sat
2 Sun
Submitted Jun 24, 2026
llm-d (llm-d.ai) is a high-performance distributed inference serving stack optimized for production deployments on Kubernetes It provides a transparent routing layer that sits between the client and vLLM serving pods, making key scheduling decisions at request granularity:
Most production inference clusters today are single-vendor — not because it is optimal, but because it is the simplest way to set things up. Real fleets are accumulating heterogeneity through procurement cycles, supply constraints, and the widening cost gap between accelerators. The open question is whether a single Kubernetes-native serving layer can take a heterogeneous GPU fleet and beat plain k8s round-robin on throughput and time-to-first-token, with no application-level changes.
We benchmarked llm-d on the NxtGen sovereign cloud’s 3-vendor cluster: 4× NVIDIA H100-NVL + 8× AMD MI325X + 8× Intel Gaudi3 over a shared 100 G RoCE fabric, serving ibm-granite/granite-4.1-8b and sarvamai/sarvam-30b. Across single-vendor pools (NVIDIA-only, AMD-only, Gaudi-only) and heterogeneous pools (NVIDIA+AMD, NVIDIA+AMD+Gaudi), llm-d’s prefix-cache-aware routing delivers +25 to +91% throughput and 5–22× better TTFT vs plain Kubernetes round-robin — and the advantage grows with pool size and heterogeneity. The biggest win is on the 20-pod 3-vendor pool, where llm-d hits +91% throughput at the same load.
Resources : https://llm-d.ai/blog/heterogeneous-inference-3-vendor-sovereign-cluster
Heterogeneous GPU fleets stop being a tax once routing is cache- and load-aware. A single Kubernetes serving layer can absorb NVIDIA + AMD + Intel concurrently and beat round-robin by close to 2× on throughput, with the same pods, same vLLM, same flags — only the routing layer differs. The win is unambiguously attributable to llm-d’s prefix-cache-aware router. Sovereign and on-prem inference is now operationally viable on mixed-vendor hardware. Procurement no longer has to align with a single vendor’s roadmap to get good aggregate throughput; older accelerators can absorb low-priority workloads while premium hardware handles latency-sensitive paths.
Platform and SRE teams running on-prem / sovereign / hybrid-cloud LLM inference; ML infrastructure engineers evaluating Kubernetes-native serving stacks (vLLM, llm-d, KServe); teams considering or already running heterogeneous GPU fleets (NVIDIA + AMD + Intel) and worried about how to schedule across them; sovereign-cloud and regulated-industry teams (BFSI, government, healthcare) who need on-prem inference and cannot rely on hyperscaler-only stacks; open-source contributors interested in the llm-d / vLLM / Gateway-API Inference Extension projects.
Pravein Govindan Kannan is a Staff Research Scientist at IBM Research working on Systems and Networking for AI Inference. He contributes to open-source projects like llm-d, UCCL and NIXL.
Jayanth Babu Reddy is a Principal Architect in NxtGen Cloud Technologies.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}