Sushrut Ikhar

Future Proofing ML Inferencing: High Performance Java meets Scalable Remote Serving

Submitted Nov 16, 2025

In this talk, we present a hybrid ML inference architecture designed for today’s performance demands and tomorrow’s infrastructure challenges. We combine SIMD-accelerated inference in pure Java—using the Vector API and Fused Multiply-Add (FMA)—with remote inference via TensorFlow Serving, ONNX, and Triton. This allows us to strike the right balance: ultra-low latency and high throughput on critical paths, and flexible, scalable inference for complex models with more relaxed SLAs. Our Java implementation avoids JNI entirely. Through assembly-level micro-benchmarking, we fine-tune matrix operations to achieve 250% improvements in latency and throughput, while staying fully within the JVM—ensuring portability, debuggability, and operational simplicity. We don’t claim Java beats C++ or Rust. Instead, we show it’s now a viable and future-ready option for performance-critical inference—especially in JVM-based stacks where speed, iteration velocity, and cost efficiency all matter. Why this matters: As models grow and workloads scale, optimizing inference infrastructure is no longer optional. Businesses need architectures that can evolve with hardware, control cloud costs, and meet diverse performance needs. Attendees will take away a blueprint for modern inference—from low-level SIMD tuning in Java to when and how to leverage remote inference—all focused on building scalable, cost-effective, and future-proof ML platforms.

Businesses requiring low latency model serving, Machine Learning engineers and tech savy enthusiasts are the target audience for this talk.

Bio: I’m Sushrut Ikhar from Inmobi, a seasoned software architect with over a decade of experience in building and scaling data and machine learning platforms. At InMobi, India’s first unicorn, he leads the Machine Learning Platform team, guiding a group of architects and engineers in designing resilient, high-performance systems that power large-scale AI applications.My core expertise spans the modern data and ML ecosystem, including Spark, Hadoop, Ray, TensorFlow, PyTorch, Airflow, Java, and Databricks, along with experience architecting high-scale, low-latency serving systems.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures