Pravein Govindan Kannan

Pravein Govindan Kannan

@praveingk

Optimized AI Inference with llm-d and a case-study on Sovereign AI on heterogenous GPU cluster

Submitted Jun 24, 2026

Introducing llm-d

llm-d (llm-d.ai) is a high-performance distributed inference serving stack optimized for production deployments on Kubernetes It provides a transparent routing layer that sits between the client and vLLM serving pods, making key scheduling decisions at request granularity:

  • Prefix-cache-aware routing — routes requests to pods whose KV-cache already holds shared prefix content (e.g., system prompts, few-shot examples), avoiding recomputation and reducing TTFT.
  • Load-aware dispatch — balances requests across pods based on real-time queue depth and concurrency, not round-robin or random assignment.
  • Optimized Networking — integrates seamlessly with NIXL, UCCL and DeepEp to provide efficient KV-Cache transfer and Expert Parallelism.
  • Vendor-agnostic scheduling — works identically across NVIDIA, AMD, and Intel accelerators; no per-vendor configuration changes.
  • Drop-in deployment — deploys as a sidecar or gateway atop any existing vLLM deployment with zero application-level code changes.

Case study: NxtGen sovereign cloud

Most production inference clusters today are single-vendor — not because it is optimal, but because it is the simplest way to set things up. Real fleets are accumulating heterogeneity through procurement cycles, supply constraints, and the widening cost gap between accelerators. The open question is whether a single Kubernetes-native serving layer can take a heterogeneous GPU fleet and beat plain k8s round-robin on throughput and time-to-first-token, with no application-level changes.

We benchmarked llm-d on the NxtGen sovereign cloud’s 3-vendor cluster: 4× NVIDIA H100-NVL + 8× AMD MI325X + 8× Intel Gaudi3 over a shared 100 G RoCE fabric, serving ibm-granite/granite-4.1-8b and sarvamai/sarvam-30b. Across single-vendor pools (NVIDIA-only, AMD-only, Gaudi-only) and heterogeneous pools (NVIDIA+AMD, NVIDIA+AMD+Gaudi), llm-d’s prefix-cache-aware routing delivers +25 to +91% throughput and 5–22× better TTFT vs plain Kubernetes round-robin — and the advantage grows with pool size and heterogeneity. The biggest win is on the 20-pod 3-vendor pool, where llm-d hits +91% throughput at the same load.

Resources : https://llm-d.ai/blog/heterogeneous-inference-3-vendor-sovereign-cluster

Take-aways

Heterogeneous GPU fleets stop being a tax once routing is cache- and load-aware. A single Kubernetes serving layer can absorb NVIDIA + AMD + Intel concurrently and beat round-robin by close to 2× on throughput, with the same pods, same vLLM, same flags — only the routing layer differs. The win is unambiguously attributable to llm-d’s prefix-cache-aware router. Sovereign and on-prem inference is now operationally viable on mixed-vendor hardware. Procurement no longer has to align with a single vendor’s roadmap to get good aggregate throughput; older accelerators can absorb low-priority workloads while premium hardware handles latency-sensitive paths.

Audience

Platform and SRE teams running on-prem / sovereign / hybrid-cloud LLM inference; ML infrastructure engineers evaluating Kubernetes-native serving stacks (vLLM, llm-d, KServe); teams considering or already running heterogeneous GPU fleets (NVIDIA + AMD + Intel) and worried about how to schedule across them; sovereign-cloud and regulated-industry teams (BFSI, government, healthcare) who need on-prem inference and cannot rely on hyperscaler-only stacks; open-source contributors interested in the llm-d / vLLM / Gateway-API Inference Extension projects.

Bio

Pravein Govindan Kannan is a Staff Research Scientist at IBM Research working on Systems and Networking for AI Inference. He contributes to open-source projects like llm-d, UCCL and NIXL.
Jayanth Babu Reddy is a Principal Architect in NxtGen Cloud Technologies.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures