At LinkedIn, we serve 100000s of inferences per second across 100s of ML models concurrently in our online systems. ML models have different system performance characteristics - ranging from lightweight XGBoosts to memory intensive recommendation models, to the newer Generative AI models, which are both compute and memory intensive. We run these models across different hardware profiles - across different CPU and GPU SKUs. Taking these into account, we have built a performance benchmarking system for ML models at LinkedIn based on the MLPerf Inference Benchmark paper. This system plays a crucial role in ensuring optimal performance and resource utilization. The system streamlines the ML model serving process, allowing ML engineers to launch models seamlessly, without the need to delve into complex hardware configurations.
We further explore the practical applications of the performance benchmarking system, which are as follows:
- Enable ML engineers and data scientists to iterate and experiment faster with models without worrying about hardware, performance characteristics and capacity estimation.
- Reduce costs through increased resource utilization by tuning system configurations.
- Build guardrails to identify and prevent regressions during rollout of new models and system software.
- Landscape of online ML inference at Linkedin
- MLperf Inference benchmarks
- Architecture
- Applications
- Challenges faced and solutions
- Future work and conclusion
- Karan Goyal
- Hareesh Kumar Gajulapalli
- Ameya Karve
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}