The success of large-scale recommender systems hinges upon their ability to deliver accurate and timely recommendations to a diverse user base. At Glance, we offer snackable personalized content to the lock screens of 200M smartphones. In this context, continuous monitoring is paramount as it safeguards data integrity, detects drifts, addresses evolving user preferences, optimizes system downtime, and ultimately augments the system’s effectiveness and user satisfaction. This talk explores the critical role of continuous monitoring in our ecosystem. We introduce Vigil, a comprehensive end-to-end monitoring framework designed specifically for Glance’s recommender systems. These practices revolve around three key pillars: mitigating developer fatigue, ensuring precise predictions, and establishing a centralized monitoring framework. By adopting these practices, we have observed an 18% increase in user engagement, a 30% reduction in compute cost, a 26% drop in downtime, and a surge in developer productivity demonstrated by a 45% decrease in turnaround time.
- Implementing a centralized system monitoring view has led to enhancements in developer productivity, effectively mitigating alert fatigue and reducing turnaround time to detect and fix issues by 45% during on-call operations.
- By effectively monitoring latencies, errors, and resource utilization, coupled with practices such as adaptive retraining and eliminating redundant data and pipelines, we have achieved a commendable 30% reduction in system costs, further optimizing performance and resource management.
- Additionally, there has been a 26% decrease in system downtime, thus ensuring enhanced reliability and uninterrupted service for our users.
Implementing Vigil has led to tangible improvements in key performance metrics, showcasing the value of effective end-to-end monitoring in large-scale recommender systems. Glance’s experience with Vigil highlights the importance of continuous monitoring. The talk offers valuable insights that can be applied to similar large-scale recommender systems, benefiting system performance, user engagement, cost-efficiency, and developer productivity.
- Introduction to Glance
- Challenges in monitoring large-scale recommender systems
- Vigil: A comprehensive end-to-end monitoring framework/practices
- Proactive alerting and system monitoring
- Dependency and impact monitoring
- Testing and performance monitoring
- Impact of adopting Vigil at Glance
- Expressway: A centralized monitoring tool built on the ideas of Vigil
- Key Takeaways
Slides Hyperlink - Vigil: Effective end-to-end monitoring for large-scale recommender systems at Glance
MLOps, Recommender Systems, ML Model lifecycle, ML Monitoring Best Practices, ML Monitoring Implementation
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}