Tuning the Inference stack for AI

Submitted Jun 21, 2025

I am submitting for: Speaking at the Fifth Elephant 2025 Annual Conference Type of submission: 30 mins talk Choose the topic your submission falls under: Data & ML Infrastructure track

Overview

Deploying large language models (LLMs) at scale puts heavy demands on AI infrastructure. The stack must be cloud-agnostic and hardware-independent, able to run everywhere—from edge devices to public clouds—and support a range of accelerators—GPUs from NVIDIA, AMD, and Intel, as well as TPUs and others. These high-end accelerators are expensive and often under-utilized. Model sizes range from tens of millions to trillions of parameters, creating serious capacity-planning challenges. Larger models need more accelerator memory and usually rely on various parallelism schemes to fit the topology of your accelerator cluster. Sharding a model across many GPUs adds complexity, and poor implementations can severely reduce throughput. Data transfer between GPUs, CPUs, and storage can also become a major latency bottleneck.

For peak performance, the inference stack needs careful tuning: quantization, KV caching, batching, disaggregation, speculative decoding, and other techniques all matter. At the same time, production systems must survive out-of-memory errors, hardware failures, and network glitches.

Nutanix Enterprise AI addresses these challenges with a turnkey, private, centralized inferencing platform. Built on Kubernetes, it runs anywhere—including air-gapped sites. It provides a scalable control plane that allows simple, one-click deployment and management of any model, giving a predictable, secure, and cost-transparent environment. This private inference management keeps all models and data under your control. Day-2 operations tools show exactly where each component runs—from infrastructure to model—so teams can easily update, scale, and troubleshoot.

Discussion points

AI infrastructure requirements
Which and how many accelerators best fit my workload?
Are my accelerators fully utilized?
When should I scale up the cluster? Am I future-proof to handle the scale?
Is the AI infrastructure stack cost-efficient?
How should I monitor the infrastructure performance?

Audience

ML Engineers selecting models and applying architectural optimizations
SRE Teams scaling and managing clusters and accelerators
Infrastructure Architects designing secure, cost-predictable on-prem or hybrid AI stacks
Application developers building AI apps or agentic workflows that call LLM endpoints

The Fifth Elephant 2025 Annual Conference CfP