The Fifth Elephant 2025 Annual Conference CfP

The Fifth Elephant 2025 Annual Conference CfP

Speak at The Fifth Elephant 2025 Annual Conference

Johnu George

@johnugeorge Submitter

Tuning the Inference stack for AI

Submitted Jun 21, 2025

Overview

Deploying large language models (LLMs) at scale puts heavy demands on AI infrastructure. The stack must be cloud-agnostic and hardware-independent, able to run everywhere—from edge devices to public clouds—and support a range of accelerators—GPUs from NVIDIA, AMD, and Intel, as well as TPUs and others. These high-end accelerators are expensive and often under-utilized. Model sizes range from tens of millions to trillions of parameters, creating serious capacity-planning challenges. Larger models need more accelerator memory and usually rely on various parallelism schemes to fit the topology of your accelerator cluster. Sharding a model across many GPUs adds complexity, and poor implementations can severely reduce throughput. Data transfer between GPUs, CPUs, and storage can also become a major latency bottleneck.

For peak performance, the inference stack needs careful tuning: quantization, KV caching, batching, disaggregation, speculative decoding, and other techniques all matter. At the same time, production systems must survive out-of-memory errors, hardware failures, and network glitches.

Nutanix Enterprise AI addresses these challenges with a turnkey, private, centralized inferencing platform. Built on Kubernetes, it runs anywhere—including air-gapped sites. It provides a scalable control plane that allows simple, one-click deployment and management of any model, giving a predictable, secure, and cost-transparent environment. This private inference management keeps all models and data under your control. Day-2 operations tools show exactly where each component runs—from infrastructure to model—so teams can easily update, scale, and troubleshoot.

Discussion points

  • AI infrastructure requirements
  • Which and how many accelerators best fit my workload?
  • Are my accelerators fully utilized?
  • When should I scale up the cluster? Am I future-proof to handle the scale?
  • Is the AI infrastructure stack cost-efficient?
  • How should I monitor the infrastructure performance?

Audience

  • ML Engineers selecting models and applying architectural optimizations
  • SRE Teams scaling and managing clusters and accelerators
  • Infrastructure Architects designing secure, cost-predictable on-prem or hybrid AI stacks
  • Application developers building AI apps or agentic workflows that call LLM endpoints

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures