|
The Fifth Elephant Pune edition LLM Inference Optimizations: A Deep Dive into Modern TechniquesProblem statement The core problem discussed is the “Memory Wall” in LLM inference—where GPU computational power has scaled dramatically (~50,000x+ in the last decade), but memory bandwidth has lagged (only 100x growth), making inference memory-bound rather than compute-bound. This leads to idle GPU cores, high latency, and inefficient resource utilization, especially for long-context models and … more
Indicate the track in which your submission fits: Track 1 AI in Software Development Life Cycle (SDLC)
Type of submission: Birds of Feather (BOF) session
|