Vigil: Effective end-to-end monitoring for large-scale recommender systems at Glance

This submission has been added to the schedule

This video is for members only

Vigil: Effective end-to-end monitoring for large-scale recommender systems at Glance

Submitted Jun 30, 2023

Abstract

The success of large-scale recommender systems hinges upon their ability to deliver accurate and timely recommendations to a diverse user base. At Glance, we offer snackable personalized content to the lock screens of 200M smartphones. In this context, continuous monitoring is paramount as it safeguards data integrity, detects drifts, addresses evolving user preferences, optimizes system downtime, and ultimately augments the system’s effectiveness and user satisfaction. This talk explores the critical role of continuous monitoring in our ecosystem. We introduce Vigil, a comprehensive end-to-end monitoring framework designed specifically for Glance’s recommender systems. These practices revolve around three key pillars: mitigating developer fatigue, ensuring precise predictions, and establishing a centralized monitoring framework. By adopting these practices, we have observed an 18% increase in user engagement, a 30% reduction in compute cost, a 26% drop in downtime, and a surge in developer productivity demonstrated by a 45% decrease in turnaround time.

Impact of Vigil

Implementing a centralized system monitoring view has led to enhancements in developer productivity, effectively mitigating alert fatigue and reducing turnaround time to detect and fix issues by 45% during on-call operations.
By effectively monitoring latencies, errors, and resource utilization, coupled with practices such as adaptive retraining and eliminating redundant data and pipelines, we have achieved a commendable 30% reduction in system costs, further optimizing performance and resource management.
Additionally, there has been a 26% decrease in system downtime, thus ensuring enhanced reliability and uninterrupted service for our users.

Key takeaways from the talk

Implementing Vigil has led to tangible improvements in key performance metrics, showcasing the value of effective end-to-end monitoring in large-scale recommender systems. Glance’s experience with Vigil highlights the importance of continuous monitoring. The talk offers valuable insights that can be applied to similar large-scale recommender systems, benefiting system performance, user engagement, cost-efficiency, and developer productivity.

Outline of the talk

Introduction to Glance
Challenges in monitoring large-scale recommender systems
Vigil: A comprehensive end-to-end monitoring framework/practices
- Proactive alerting and system monitoring
- Dependency and impact monitoring
- Testing and performance monitoring
Impact of adopting Vigil at Glance
Expressway: A centralized monitoring tool built on the ideas of Vigil
Key Takeaways

Slides Hyperlink - Vigil: Effective end-to-end monitoring for large-scale recommender systems at Glance

Labels

MLOps, Recommender Systems, ML Model lifecycle, ML Monitoring Best Practices, ML Monitoring Implementation

Speakers

Priyansh Saxena, Data Scientist, InMobi Group
Manisha R, Data Scientist, InMobi Group

The Fifth Elephant 2023 Monsoon

Vigil: Effective end-to-end monitoring for large-scale recommender systems at Glance

Abstract

Impact of Vigil

Key takeaways from the talk

Outline of the talk

Labels

Speakers

Comments