Kundan Kumar

The Fifth Elephant - Pune Edition

LLM Inference Optimizations: A Deep Dive into Modern Techniques

Problem statement The core problem discussed is the “Memory Wall” in LLM inference—where GPU computational power has scaled dramatically (~50,000x+ in the last decade), but memory bandwidth has lagged (only 100x growth), making inference memory-bound rather than compute-bound. This leads to idle GPU cores, high latency, and inefficient resource utilization, especially for long-context models and … more

0 comments
Confirmed & scheduled
26 Jan 2026

Indicate the track in which your submission fits: Track 1 AI in Software Development Life Cycle (SDLC) Type of submission: Birds of Feather (BOF) session