The Fifth Elephant 2023 Monsoon

On AI, industrial applications of ML, and MLOps



This video is for members only

Priyansh Saxena

Priyansh Saxena


Vigil: Effective end-to-end monitoring for large-scale recommender systems at Glance

Submitted Jun 30, 2023


The success of large-scale recommender systems hinges upon their ability to deliver accurate and timely recommendations to a diverse user base. At Glance, we offer snackable personalized content to the lock screens of 200M smartphones. In this context, continuous monitoring is paramount as it safeguards data integrity, detects drifts, addresses evolving user preferences, optimizes system downtime, and ultimately augments the system’s effectiveness and user satisfaction. This talk explores the critical role of continuous monitoring in our ecosystem. We introduce Vigil, a comprehensive end-to-end monitoring framework designed specifically for Glance’s recommender systems. These practices revolve around three key pillars: mitigating developer fatigue, ensuring precise predictions, and establishing a centralized monitoring framework. By adopting these practices, we have observed an 18% increase in user engagement, a 30% reduction in compute cost, a 26% drop in downtime, and a surge in developer productivity demonstrated by a 45% decrease in turnaround time.

Impact of Vigil

  • Implementing a centralized system monitoring view has led to enhancements in developer productivity, effectively mitigating alert fatigue and reducing turnaround time to detect and fix issues by 45% during on-call operations.
  • By effectively monitoring latencies, errors, and resource utilization, coupled with practices such as adaptive retraining and eliminating redundant data and pipelines, we have achieved a commendable 30% reduction in system costs, further optimizing performance and resource management.
  • Additionally, there has been a 26% decrease in system downtime, thus ensuring enhanced reliability and uninterrupted service for our users.

Key takeaways from the talk

Implementing Vigil has led to tangible improvements in key performance metrics, showcasing the value of effective end-to-end monitoring in large-scale recommender systems. Glance’s experience with Vigil highlights the importance of continuous monitoring. The talk offers valuable insights that can be applied to similar large-scale recommender systems, benefiting system performance, user engagement, cost-efficiency, and developer productivity.

Outline of the talk

  • Introduction to Glance
  • Challenges in monitoring large-scale recommender systems
  • Vigil: A comprehensive end-to-end monitoring framework/practices
    • Proactive alerting and system monitoring
    • Dependency and impact monitoring
    • Testing and performance monitoring
  • Impact of adopting Vigil at Glance
  • Expressway: A centralized monitoring tool built on the ideas of Vigil
  • Key Takeaways

Slides Hyperlink - Vigil: Effective end-to-end monitoring for large-scale recommender systems at Glance


MLOps, Recommender Systems, ML Model lifecycle, ML Monitoring Best Practices, ML Monitoring Implementation



{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid access (members only)

Hosted by

All about data science and machine learning

Supported by

E2E Cloud is India's first AI hyper scaler, a cloud computing platform providing accelerated cloud-based solutions at maximum optimization and lowest pricing