Jun 2026
15 Mon
16 Tue
17 Wed
18 Thu
19 Fri 02:00 PM – 06:00 PM IST
20 Sat
21 Sun
Vivek Kalyanarangan
Submitted May 26, 2026
Face matching is one of the highest-volume workloads in identity verification. At IDfy, a single GPU pod handling 1 RPS cost us ₹3,500/day. After moving the model to BF16 inference on Intel CPUs via OpenVINO, the same 1 RPS pod cost ₹350/day. Same TAT, same throughput, same accuracy envelope. At our traffic shape (50 RPS sustained for the peak hour, 10 RPS for the remaining 23), that translates to roughly ₹11 lakh a month in savings on this single workload, before you account for the GPU capacity it freed up for workloads that genuinely need it.
This talk is not a “CPU beats GPU” pitch. It is the operational story of how we got there: the calibration set we built, the operators that refused to quantize cleanly, the one architectural tweak we made so OpenVINO could fuse properly, and the production canary we ran to convince ourselves the accuracy was stable. I’ll share two more migrations from IDfy’s 40+ model fleet, including one where the move failed in production and what telemetry caught it before users did.
Takeaways:
Audience:
Production ML and AI engineers, platform and infra teams, and engineering leaders who own inference cost-to-serve at scale.
Bio:
Vivek Kalyanarangan is Sr. Technical Architect, AI at IDfy, where a 20-person team operates 40+ production ML models across biometric authentication, document recogntion and OCR, fraud detection and large scale NLP. He has 13+ years across analytics, big data, and deep learning.
Author of Quantization and Fast Inference (Manning, MEAP 2026) and freeCodeCamp course LLMs from Scratch. Contributor to open source ML and published papers.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}