Pravein Govindan Kannan

@praveingk

Sovereign LLM Inference on Heterogenous AI Accelerators Using llm-d and vLLM

Submitted Jun 8, 2026

Description

Most production inference clusters today are single-vendor — not because it is
optimal, but because it is the simplest way to set things up. Real fleets are
accumulating heterogeneity, through procurement cycles, supply constraints,
and the widening cost gap between accelerators. The open question is whether a
single Kubernetes-native serving layer can take a heterogeneous GPU fleet and
beat plain k8s round-robin on throughput and time-to-first-token, with no
application-level changes. This lightning talk reports what we measured.

We benchmarked llm-d (a CNCF inference framework built on vLLM and the
Gateway-API InferencePool) on the NxtGen sovereign cloud’s 3-vendor cluster:
4× NVIDIA H100-NVL + 8× AMD MI325X + 8× Intel Gaudi3 over a shared 100 G RoCE
fabric, serving ibm-granite/granite-4.1-8b and sarvamai/sarvam-30b. Across
single-vendor pools (NVIDIA-only, AMD-only, Gaudi-only) and heterogeneous pools
(NVIDIA+AMD, NVIDIA+AMD+Gaudi), llm-d’s prefix-cache-aware routing
delivers +25 to +91% throughput and 5–22× better TTFT vs plain
Kubernetes round-robin — and the advantage grows with pool size and
heterogeneity. The biggest win is on the 20-pod 3-vendor pool, where llm-d
hits +91% throughput at the same load.

Artifacts are available in https://github.com/praveingk/llmd-benchmarking-nxtgen

Take-aways:

  • Heterogeneous GPU fleets stop being a tax once routing is cache- and
    load-aware.
    A single Kubernetes serving layer can absorb NVIDIA + AMD +
    Intel concurrently and beat round-robin by close to 2× on throughput, with
    the same pods, same vLLM, same flags — only the routing layer differs. The
    win is unambiguously attributable to llm-d’s prefix-cache-aware router.
  • Sovereign and on-prem inference is now operationally viable on
    mixed-vendor hardware.
    Procurement no longer has to align with a single
    vendor’s roadmap to get good aggregate throughput; older accelerators can
    absorb low-priority workloads while premium hardware handles
    latency-sensitive paths.

Audience:

  • Platform and SRE teams running on-prem / sovereign / hybrid-cloud LLM inference
  • ML infrastructure engineers evaluating Kubernetes-native serving stacks
    (vLLM, llm-d, KServe)
  • Teams considering or already running heterogeneous GPU fleets
    (NVIDIA + AMD + Intel) and worried about how to schedule across them
  • Sovereign-cloud and regulated-industry teams (BFSI, government, healthcare)
    who need on-prem inference and cannot rely on hyperscaler-only stacks
  • Open-source contributors interested in the llm-d / vLLM / Gateway-API
    Inference Extension projects

Bio:

Pravein Govindan Kannan is a Staff Research Scientist at IBM Research working on Systems and Networking for AI Inference. He contributes to open-source projects like llm-d, UCCL and NIXL.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures