Sovereign LLM Inference on Heterogenous AI Accelerators Using llm-d and vLLM

Jun 2026

15 Mon

16 Tue

17 Wed

18 Thu

19 Fri 02:00 PM – 06:00 PM IST

20 Sat

21 Sun

Sovereign LLM Inference on Heterogenous AI Accelerators Using llm-d and vLLM

Submitted Jun 8, 2026

Submission type: Lightning talks (10 mins)

Description

Most production inference clusters today are single-vendor — not because it is
optimal, but because it is the simplest way to set things up. Real fleets are
accumulating heterogeneity, through procurement cycles, supply constraints,
and the widening cost gap between accelerators. The open question is whether a
single Kubernetes-native serving layer can take a heterogeneous GPU fleet and
beat plain k8s round-robin on throughput and time-to-first-token, with no
application-level changes. This lightning talk reports what we measured.

We benchmarked llm-d (a CNCF inference framework built on vLLM and the
Gateway-API InferencePool) on the NxtGen sovereign cloud’s 3-vendor cluster:
4× NVIDIA H100-NVL + 8× AMD MI325X + 8× Intel Gaudi3 over a shared 100 G RoCE
fabric, serving ibm-granite/granite-4.1-8b and sarvamai/sarvam-30b. Across
single-vendor pools (NVIDIA-only, AMD-only, Gaudi-only) and heterogeneous pools
(NVIDIA+AMD, NVIDIA+AMD+Gaudi), llm-d’s prefix-cache-aware routing
delivers +25 to +91% throughput and 5–22× better TTFT vs plain
Kubernetes round-robin — and the advantage grows with pool size and
heterogeneity. The biggest win is on the 20-pod 3-vendor pool, where llm-d
hits +91% throughput at the same load.

Artifacts are available in https://github.com/praveingk/llmd-benchmarking-nxtgen

Take-aways:

Heterogeneous GPU fleets stop being a tax once routing is cache- and
load-aware. A single Kubernetes serving layer can absorb NVIDIA + AMD +
Intel concurrently and beat round-robin by close to 2× on throughput, with
the same pods, same vLLM, same flags — only the routing layer differs. The
win is unambiguously attributable to llm-d’s prefix-cache-aware router.
Sovereign and on-prem inference is now operationally viable on
mixed-vendor hardware. Procurement no longer has to align with a single
vendor’s roadmap to get good aggregate throughput; older accelerators can
absorb low-priority workloads while premium hardware handles
latency-sensitive paths.

Audience:

Platform and SRE teams running on-prem / sovereign / hybrid-cloud LLM inference
ML infrastructure engineers evaluating Kubernetes-native serving stacks
(vLLM, llm-d, KServe)
Teams considering or already running heterogeneous GPU fleets
(NVIDIA + AMD + Intel) and worried about how to schedule across them
Sovereign-cloud and regulated-industry teams (BFSI, government, healthcare)
who need on-prem inference and cannot rely on hyperscaler-only stacks
Open-source contributors interested in the llm-d / vLLM / Gateway-API
Inference Extension projects

Bio:

Pravein Govindan Kannan is a Staff Research Scientist at IBM Research working on Systems and Networking for AI Inference. He contributes to open-source projects like llm-d, UCCL and NIXL.

Enterprise AI in Production

Sovereign LLM Inference on Heterogenous AI Accelerators Using llm-d and vLLM

Comments