Kundan Kumar

@shadow_walker9170 BOF facilitator

LLM Inference Optimizations: A Deep Dive into Modern Techniques

Submitted Jan 26, 2026

Problem statement

The core problem discussed is the “Memory Wall” in LLM inference—where GPU computational power has scaled dramatically (~50,000x+ in the last decade), but memory bandwidth has lagged (only 100x growth), making inference memory-bound rather than compute-bound. This leads to idle GPU cores, high latency, and inefficient resource utilization, especially for long-context models and batch processing.

Under this topic, we intend to cover a few popular techniques on improving memory usage efficiency such as the following to unlock the LLM potentials for:

  • Flash Attention
  • Virtual memory-inspired techniques to eliminate fragmentation, Paged Attention and Prefix Caching
  • Use of KV Caches and KV Cache Compression
  • Continuous Batching and Speculative Decoding to Alleviate bandwidth bottlenecks and improve the compute-to-memory movement ratio

Key takeaways

  1. Attendees will gain a clear understanding of why LLM inference is memory-bound and how use of kv cache and techniques like Flash Attention and Paged Attention can achieve 2-4x speedups and higher GPU utilization, enabling longer contexts and larger batches without hardware upgrades.
  2. Participants will learn actionable strategies for KV cache management and speculative decoding, leading to faster token generation (~2x - 3x)while maintaining equivalence to standard methods, directly applicable to real-world serving systems like vLLM.

Audiences for this session

This discussion will benefit:

  • Machine learning engineers and AI developers involved in deploying and scaling LLMs in production environments, who need practical techniques to reduce latency and costs.
  • Researchers and data scientists focused on transformer architectures, seeking insights into memory bottlenecks and optimization trade-offs.
  • Product managers and tech leads in AI-driven companies (e.g., chatbots, recommendation systems), who can apply these efficiencies to improve throughput and user experience.

About the facilitator

Kundan Kumar is a final year Computer Science student at IIT Kanpur. He has worked on KV caching systems at Nutanix as a visiting researcher. His interests lie at the intersection of systems optimization and AI infrastructure.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures